Last couple of weeks proved how important the data is. When the market crashed everybody was hunting for data, looking at the charts, trying to figure out what just happened and what to do next. And it turned out that getting the right data is very complicated, in many aspects.
TL;DR: we need a Data Insights Core Unit in the DAO.
To analyze the data you need to have the data first. In the blockchain world all the data is public and simply lies on the streets. Everybody’s free to query a node and see the full history of any state and action. Additionally there are many good sources of reliable and free off-chain data (e.g. market prices, NFT metadata repositories, etc.).
So why so many questions, why so few answers? There are two main reasons:
- data is complicated
- nodes are slow
On-chain data is complicated because it’s very raw and technical. Additionally, due to the growing complexity of DeFi and high gas prices, everything gets so squeezed and optimized that it’s really hard to make sense of the real meaning of a transaction. Now with addition of L2, rollups, etc. the entry barrier is enormous. People want to see operations in some business Context (e.g. liquidation, vote, arbitrage) but they only see some long bytestrings.
Nodes are slow because they are not designed for querying historical data. Try to ask a node for the total gas cost of reverted collateral deposits. Even if you know how to do this you will wait days (or weeks) for the answer.
The same data can be used for different purposes. Smart contract designers will be interested in called methods’ stats and revert reasons, risk researchers will look at some derived financial indicators and their distributions, and product analysts will try to understand usage patterns or figure out GUI used to make a transaction.
This is mostly the same data, but provided and used in a different way. Requirements and focus change a lot depending on the use case.
Things to be considered:
- data timeliness and finality
- data quality
- level of decoding and classification
Data timeliness is one of the main things to be considered. Sometimes you strongly need very current data (‘This transfer is confirmed in Etherscan so why no money in my account!’, ‘This vault should already be liquidated so why do I still see it here?’). Sometimes you do not care at all what happened today, because your focus is on the history (‘What is the average lifespan of vaults that were ever liquidated? Do people abandon them or keep using them?’). Typically it is monitoring vs. analyzing point of view. And data finality is a beast specific to the blockchain, because if you push too hard to have up-to-last-second data you have to deal with reorgs and stuff.
And the bad news is that you cannot easily have real-time and historical data at the same time. These two approaches require completely different technology, architecture and skill-set.
Data quality is not free. It requires significant time and effort to constantly ensure that the data is current, complete, consistent, non-duplicated and usable. Especially in the highly volatile environment. And the requirements for quality differ depending on the use case. Slightly mismatched debt calculation is crucial for the vault owner but irrelevant for analysts building a DeFi systemic risk model. Completeness of protocols data is very important in the latter case but void for a regular user.
Thus depending on the requirements, proper data quality rules should be agreed and assured. But the more strict you want to be, the more it will cost you.
Decoding and classification are a Holy Grail of data analysts. If you are looking for valuable insights you simply have no time and patience to transform raw data into something that’s usable - over and over again. Data Engineering and Data Science are completely separate realms and should not be confused.
Different analyses require different data preparation. Some things can and should be done in a standardized and uniform way (e.g. decoding of calls and events based on ABI, decoding state and actions based on standards like ERC20). But the real usefulness of such data, which is a typical offering of many existing data providers, is quite limited. Only by using protocol-specific semantics you get the information that can be used to build meaningful reports, dashboards and models. Which means you have to understand the protocol all the way to the bottom.
Data in itself does not meet the needs of users. Context is everything. Without being able to interpret the data and the conditions that existed around each data point, at worst the wrong (or conflicting) conclusions will be drawn and the wrong decisions will be made.
It’s typical to build data pipelines and reporting systems that provide repeatedly the same information, only updated in time. Whenever you need a new chart or table you just add them to the system and keep reading it every day.
But the real problem starts in times of panic and fight. There is no time to design and implement a new fancy dashboard. You just need answers. You need it now!
This is the moment of truth for your data providers. You quickly realize that either you have no one to ask or you wait in a long queue to get what you need.
Properly designed data environment should be flexible enough to provide quick answers to the most unexpected questions, which requires a wide scope of stored data, robust data model and a skilled team of people ready to use the tools when they are required.
Maker DAO and Maker Community need Data and the ability to apply Context to the data in order to be able to interpret it. Detailed history of the Protocol state and actions, especially vaults and liquidations stats, governance events, MKR ownership, etc., are just examples of datasets that are very interesting to many users.
This data cannot be easily obtained from existing data providers (Google BigQuery, Dune, etc.) because they offer very generic datasets that require heavy post-processing. Also some very detailed questions require access to historical state of variables which is not offered by anybody.
Even more important consideration is a need for a dedicated Team that has data related skills, perfectly knows the Protocol and is always available when needed. Such a team should serve all other Core Units by assuring the data availability, quality and usability, by delivering bespoken reports and datasets, answering ad-hoc questions, providing analytical skills, etc. Critically, such a CU with deep protocol and ecosystem understanding can apply vital context to the data and will allow the DAO and the Community to draw insight from the data.
If it needs proving, let’s just look at the recent situation with a crashing market and waves of liquidations. Some of the existing APIs built by Foundation’s Data Services team (e.g. MCDState, Liq2.0) turned out to be so important that people started using them for real-time dashboards and decision making processes. Even though they were never meant for this and from an architectural perspective cannot be updated more often than every 15-20 minutes.
Of course it is possible to create a more robust solution which will have historical and near-real-time information at the same time. And the scope of available data also can and should be broadened. Such a solution would also require proper infrastructure and ongoing maintenance (from the technical and business perspective).
But who should do this?
My thesis is that the DAO strongly needs a dedicated Data Insights CU.
I’d like to start a discussion between all Core Units to understand their data needs and come up with a plan on how to effectively answer them in the long term.
But based on my current discussion with members of MakerDAO I see many potential data products that can be quickly delivered to other teams in a form of fully maintained and quality controlled datasets/APIs on Service Level Agreements basis:
- full Vaults history, decoded into a usable form and put into a proper context, e.g. operations (successful and reverted) with debt, collateralization, accrued fees, OSM and market prices at the moment of the transaction,
- full Liquidations history, also provided in a context of vault state, market state and keepers involved,
- protocol PnL analysis, users acquisition and churn, user lifetime value, etc.
- full Governance history, also properly decoded, e.g. MKR staking (including future Delegate Contracts), voting, polls, etc.,
- MKR current and historical balances and transfers taking into account the real ownership of the token (e.g. MKRs locked in Uniswap still belong to the liquidity provider not to Uniswap), decentralization level, flash voting or collusion risks, etc.,
- DAI usage in time with proper semantics of used protocols (e.g. trading, depositing, lending, yield farming), external services (e.g. credit cards, shops) and L2 solutions,
- source protocols and used wallets analysis…
… and many more, because data needs are endless and only growing in time.
The details of potential Data Insights CU are yet to be defined. I’m in the process of drafting the Mandate and Budget documents, but do not want to go too far without discussing the real needs and demands.
I’m also looking for people interested in bootstrapping such a team.
My name is Tomek Mierzwa (RC: @tmierzwa). I have over 20 years of experience with data engineering, data quality management, data governance and business intelligence.
For the last year and the half I’ve been leading the Data Services team in the Foundation. I had a pleasure to work with many of you, providing data and insights, answering questions and building data related products. With the help and inspiration of many people I created the EthTx.info transaction decoder as well as the MCDState, MCDGov and Liq2.0 datasets.
Currently I’m leaving the Foundation and I’m involved in building Token Flow - a venture dedicated to monetizing blockchain data inside and outside the DeFi ecosystem. Regardless, I’d love to help create the Data Insights CU in the DAO by sharing my experience with data and the Protocol.