Need for Data Insights Core Unit

Last couple of weeks proved how important the data is. When the market crashed everybody was hunting for data, looking at the charts, trying to figure out what just happened and what to do next. And it turned out that getting the right data is very complicated, in many aspects.

TL;DR: we need a Data Insights Core Unit in the DAO.


Data sources

To analyze the data you need to have the data first. In the blockchain world all the data is public and simply lies on the streets. Everybody’s free to query a node and see the full history of any state and action. Additionally there are many good sources of reliable and free off-chain data (e.g. market prices, NFT metadata repositories, etc.).

So why so many questions, why so few answers? There are two main reasons:

  • data is complicated
  • nodes are slow

On-chain data is complicated because it’s very raw and technical. Additionally, due to the growing complexity of DeFi and high gas prices, everything gets so squeezed and optimized that it’s really hard to make sense of the real meaning of a transaction. Now with addition of L2, rollups, etc. the entry barrier is enormous. People want to see operations in some business Context (e.g. liquidation, vote, arbitrage) but they only see some long bytestrings.

Nodes are slow because they are not designed for querying historical data. Try to ask a node for the total gas cost of reverted collateral deposits. Even if you know how to do this you will wait days (or weeks) for the answer.

Use cases

The same data can be used for different purposes. Smart contract designers will be interested in called methods’ stats and revert reasons, risk researchers will look at some derived financial indicators and their distributions, and product analysts will try to understand usage patterns or figure out GUI used to make a transaction.

This is mostly the same data, but provided and used in a different way. Requirements and focus change a lot depending on the use case.

Things to be considered:

  • data timeliness and finality
  • data quality
  • level of decoding and classification

Data timeliness is one of the main things to be considered. Sometimes you strongly need very current data (‘This transfer is confirmed in Etherscan so why no money in my account!’, ‘This vault should already be liquidated so why do I still see it here?’). Sometimes you do not care at all what happened today, because your focus is on the history (‘What is the average lifespan of vaults that were ever liquidated? Do people abandon them or keep using them?’). Typically it is monitoring vs. analyzing point of view. And data finality is a beast specific to the blockchain, because if you push too hard to have up-to-last-second data you have to deal with reorgs and stuff.

And the bad news is that you cannot easily have real-time and historical data at the same time. These two approaches require completely different technology, architecture and skill-set.

Data quality is not free. It requires significant time and effort to constantly ensure that the data is current, complete, consistent, non-duplicated and usable. Especially in the highly volatile environment. And the requirements for quality differ depending on the use case. Slightly mismatched debt calculation is crucial for the vault owner but irrelevant for analysts building a DeFi systemic risk model. Completeness of protocols data is very important in the latter case but void for a regular user.

Thus depending on the requirements, proper data quality rules should be agreed and assured. But the more strict you want to be, the more it will cost you.

Decoding and classification are a Holy Grail of data analysts. If you are looking for valuable insights you simply have no time and patience to transform raw data into something that’s usable - over and over again. Data Engineering and Data Science are completely separate realms and should not be confused.

Different analyses require different data preparation. Some things can and should be done in a standardized and uniform way (e.g. decoding of calls and events based on ABI, decoding state and actions based on standards like ERC20). But the real usefulness of such data, which is a typical offering of many existing data providers, is quite limited. Only by using protocol-specific semantics you get the information that can be used to build meaningful reports, dashboards and models. Which means you have to understand the protocol all the way to the bottom.

The importance of context

Data in itself does not meet the needs of users. Context is everything. Without being able to interpret the data and the conditions that existed around each data point, at worst the wrong (or conflicting) conclusions will be drawn and the wrong decisions will be made.

Ad-hoc requests

It’s typical to build data pipelines and reporting systems that provide repeatedly the same information, only updated in time. Whenever you need a new chart or table you just add them to the system and keep reading it every day.

But the real problem starts in times of panic and fight. There is no time to design and implement a new fancy dashboard. You just need answers. You need it now!

This is the moment of truth for your data providers. You quickly realize that either you have no one to ask or you wait in a long queue to get what you need.

Properly designed data environment should be flexible enough to provide quick answers to the most unexpected questions, which requires a wide scope of stored data, robust data model and a skilled team of people ready to use the tools when they are required.


Maker DAO and Maker Community need Data and the ability to apply Context to the data in order to be able to interpret it. Detailed history of the Protocol state and actions, especially vaults and liquidations stats, governance events, MKR ownership, etc., are just examples of datasets that are very interesting to many users.

This data cannot be easily obtained from existing data providers (Google BigQuery, Dune, etc.) because they offer very generic datasets that require heavy post-processing. Also some very detailed questions require access to historical state of variables which is not offered by anybody.

Even more important consideration is a need for a dedicated Team that has data related skills, perfectly knows the Protocol and is always available when needed. Such a team should serve all other Core Units by assuring the data availability, quality and usability, by delivering bespoken reports and datasets, answering ad-hoc questions, providing analytical skills, etc. Critically, such a CU with deep protocol and ecosystem understanding can apply vital context to the data and will allow the DAO and the Community to draw insight from the data.

If it needs proving, let’s just look at the recent situation with a crashing market and waves of liquidations. Some of the existing APIs built by Foundation’s Data Services team (e.g. MCDState, Liq2.0) turned out to be so important that people started using them for real-time dashboards and decision making processes. Even though they were never meant for this and from an architectural perspective cannot be updated more often than every 15-20 minutes.

Of course it is possible to create a more robust solution which will have historical and near-real-time information at the same time. And the scope of available data also can and should be broadened. Such a solution would also require proper infrastructure and ongoing maintenance (from the technical and business perspective).

But who should do this?
My thesis is that the DAO strongly needs a dedicated Data Insights CU.

I’d like to start a discussion between all Core Units to understand their data needs and come up with a plan on how to effectively answer them in the long term.

But based on my current discussion with members of MakerDAO I see many potential data products that can be quickly delivered to other teams in a form of fully maintained and quality controlled datasets/APIs on Service Level Agreements basis:

  • full Vaults history, decoded into a usable form and put into a proper context, e.g. operations (successful and reverted) with debt, collateralization, accrued fees, OSM and market prices at the moment of the transaction,
  • full Liquidations history, also provided in a context of vault state, market state and keepers involved,
  • protocol PnL analysis, users acquisition and churn, user lifetime value, etc.
  • full Governance history, also properly decoded, e.g. MKR staking (including future Delegate Contracts), voting, polls, etc.,
  • MKR current and historical balances and transfers taking into account the real ownership of the token (e.g. MKRs locked in Uniswap still belong to the liquidity provider not to Uniswap), decentralization level, flash voting or collusion risks, etc.,
  • DAI usage in time with proper semantics of used protocols (e.g. trading, depositing, lending, yield farming), external services (e.g. credit cards, shops) and L2 solutions,
  • source protocols and used wallets analysis…

… and many more, because data needs are endless and only growing in time.

The details of potential Data Insights CU are yet to be defined. I’m in the process of drafting the Mandate and Budget documents, but do not want to go too far without discussing the real needs and demands.

I’m also looking for people interested in bootstrapping such a team.


My name is Tomek Mierzwa (RC: @tmierzwa). I have over 20 years of experience with data engineering, data quality management, data governance and business intelligence.

For the last year and the half I’ve been leading the Data Services team in the Foundation. I had a pleasure to work with many of you, providing data and insights, answering questions and building data related products. With the help and inspiration of many people I created the EthTx.info transaction decoder as well as the MCDState, MCDGov and Liq2.0 datasets.

Currently I’m leaving the Foundation and I’m involved in building Token Flow - a venture dedicated to monetizing blockchain data inside and outside the DeFi ecosystem. Regardless, I’d love to help create the Data Insights CU in the DAO by sharing my experience with data and the Protocol.

28 Likes

We need this yesterday. I look forward to seeing the formal proposal and budget.

3 Likes

So huge. Not only do you need a team that can access the data, but one that can understand the ecosystem around it at the time the data is pulled from. I don’t think it’s an easy mission, but a very crucial one.

Speaking as a member of GovAlpha, I know that there are so many Governance projects we have been exploring or would like to explore that could benefit from being able to answer questions of historical use and number of unique wallet interactions. Currently tools like:
https://beta.mcdgov.info/

can serve us for basic inquiries about voting, but there could be a lot more insight from the data is we were able to clean it and distill helpful queries like how much does gas cost effect voter participation?

Would be in heavy support of this Core Unit forming, happy to toss around ideas with the team in terms of more concrete inquiries if that would be helpful!

2 Likes

Thanks for starting the discussion - it’s an interesting proposal!

I don’t disagree with many of the things you’ve said about the importance of data and insights but it’s not clear if there needs to be a special data insights CU. It may well be that letting the existing CUs handle our data insight needs and communicate among themselves well enough to avoid double work is a better solution.

Almost all our CUs already collect, process, store and get insights from data from many sources. Some of the points you mentioned do exist within our CUs (e.g. the Risk CU presentations). Now, having a new CU that’s supposed to do this job for them could be a net negative by introducing an extra communication layer that acts as a source of friction.

The question is: how much overlapping work in terms of data collection, processing and presentation is going on right now? If the answer is not too much, I’m in favour of leaving the individual CUs handle their needs, recruit their own data science teams and generally function in as self sufficient a way as possible. Even if there’s a lot of overlap and we need to build APIs, have better quality control and then provide them as a service both to other CUs and the wider community, that’s still something that can be done at a CU level.

My concern is that the rate at which new CUs are being proposed leads to a risk of a bloated organization with too many cooks in the kitchen.

I agree, but Maker is both starved for data and not doing a good job advertising internally what data we do have.

Example: Before I asked and made a very crude attempt, did we have any idea whether liquidated vaults continued to be used by paying customers?

Example 2: Again, a very crude, ad hoc attempt at what percentage of our outstanding DAI came from how many users.

Example 3: Annualized revenue by ilk, forward or backwards looking

Most of the data for all three of these came from @tmierzwa. To find @tmierzwa, I pretty much followed a long chain of people saying, “I don’t have that data, but try so-and-so.” So I am hesitant to say we have lots of data already floating around waiting to be used.

These are just three random — but fairly important for us to know — questions. I’m sure anyone here can come up with three that are equally important or even more so. We have loads of energetic, curious people scattered across lots of CUs who would love to have/do analysis on our data, but it’s not easy to find it. Even those that already have it usually have other duties and don’t really need to be burdened with answering data requests all the time.

We have a Content CU to make materials for CUs so they don’t have to. I think relieving most CUs of the burden of running down or even analyzing data makes a lot of sense.

3 Likes

My last budget proposal has an open position for a Data person. So if you happen to love SQL, know a bit of Python (maybe even Airflow?) and the blockchain, ping me if you want to work on MakerDAO data.

Having worked on business intelligence for years, my philosophy is that data insights work best when there is a purpose instead of being a tech thing. But even a CU producing amazing pieces of data content is good to have.

I also think that CUs are easier to manage when they are autonomous. We all have different objectives and prioritization in a decentralized organization is no simple thing.

7 Likes

Fair points but given that we’re in a transition period now, a bit of internal disorganization is probably to be expected. If each CU has data people, like Seb is looking for, and they’re all fully operational, I suspect it’ll get a lot easier. At that point, it isn’t clear if a separate data insights CU will be adding value or not.

This works well in theory but in practice, there’ll be quite a bit of friction. I say this as someone who has seen how data science teams work in a large company. Given that everything is on the blockchain (i.e. data availability isn’t an issue), it’s likely a lot quicker and simpler for the CUs to extract what they need directly than go through an intermediary CU.

All this is not to be a party pooper - it’s just that the idea of having a common data store from which CUs can extract insights they need is something that sounds good in theory but tends to not work as well in practice. The same set of people of the proposed CU can instead be integrated within existing CUs. This will let them perform their job a lot better in my opinion.

Given that @tmierzwa has been the data guy so far, perhaps he’s got the best overview on this one to judge what the best way forward is.

2 Likes

Don’t worry. If there’s one thing that you can probably guess about me, it’s that when budget renewal comes around, I’m happy to be That Guy to ask if a CU is superfluous.

As an aside, I’d feel more comfortable if the DAO’s “eyes” worked directly for the DAO, and not as the employees/contractors of the various CUs — who at the end of the day are our contractors. That doesn’t mean we shouldn’t have data positions within CUs, but there also needs to be an independent way for the DAO in general to inspect data and draw conclusions so we’re not stuck evaluating CU performance only by self-reported data.

Maybe we don’t need a permanent Data CU, maybe we do. I’m agnostic on that. But we definitely do not have a firm grasp on describing — much less analyzing — our business quantitatively right this moment. If this is the fastest way to close that gaping hole, I’m all for it*

*Pending what the actual mandate and budget proposals say

1 Like

Thanks a lot for your valuable and kind feedback.

I’ll try to give more detailed answers later but I want to address one point that seems to be most disputable so far - centralized vs. decentralized data functions.

  1. It’s obvious to me that most CUs should have their internal ‘data guys’. But what exactly is such a role? Probably he or she is data analyst / data scientist, somebody with great analytical skills and domain knowledge. Somebody with a business and/or financial and/or sociological background. Most probably with experience from the non-crypto world (I guess that supply of great analysts with crypto experience is rather limited these days :slight_smile: ). And now please try to teach this person how to extract the on-chain data. Or buy him an access to Dune so he can aggregate some Vat events to get vaults users profitability in time. A bit of SQL and Python skills might not be enough to get the required information quickly. The gap between analytical and engineering skills is huge in the traditional data world, but in the blockchain space is just enormous.

  2. I’m not suggesting that data roles should be centralized in a single place. Quite opposite - I believe that each Core Unit should have a strong analytical team. What I’m saying is that centralizing data collection, quality assurance, semantic decoding and ongoing maintenance processes is the enabler of such a model. Without it Core Units will burn time and money on tedious, repetitive tasks. In the corporate non-crypto world that I know very well it’s typical to waste 70-80% of the expensive time of data analysts / data scientists on data hunting tasks which they truly hate. I’m sure that in our world it’s even worse.

  3. As I discussed in my post - every Core Unit has completely different and specific information use cases. But typically their needs are based on the same data, just processed and aggregated in a different way. So having single source of truth, i.e. well maintained, properly documented and trusted datasets is a simplest way to avoid:

  • costs of repeating the same data collection, integration, enrichment and control tasks
  • potentially conflicting reports and disputes which data is correct
  • frantic looking for new data when it is unexpectedly and urgently needed
  • demotivated data analysts bogged down with source data problems
10 Likes

Big support of such Core Unit and I’ll tell you why after working about 2.5 years for Maker independently outside of the foundation.

If anyone thinks it is easy to find “a data guy” that knows Maker architecture to reveal important data, he is wrong. I can almost say this was my biggest traumatic experience when working for Maker independently, especially after mkr.tools was unusable after MCD launch.

Such person is very costly and needs time to get to know every piece of code at Maker to reveal data that is needed. And then if you were able to find one such person, it was almost better for him to do smart contracts work. @hexonaut was a good example where I actually wished he could help us with on demand on-chain data and be part of the team (he even developed mkranalytics.com), but obviously he is much better utilized working for Protocol Engineering Core unit.

Putting it all together I can’t imagine every CU paying their own data scientist for this, it would be too costly and inefficient. And yes, I am obviously biased because such core unit would have a lot of benefits for us, but I am sure for many others as well.

12 Likes

@tmierzwa Great to see this idea coming to life. I think this is in line with a post I made a little while ago. Having been working in analytics for the best part of the last 5 years or so, I recognise how tricky it can be working with blockchain/node data, and maintaining it is even trickier. A number of projects have already tried cracking this, from Alethio back in the days to many others. And with the upcoming complexities of L2 I can only see engineering in this space becoming more and more complex to retrieve valuable/insightful data.

Saying that, I’ve worked with a variety of teams at different stages of maturity in the data journeys and engagements with “the data consumers” (data analysts, data Viz, data scientists, end users etc). Usually the projects with the most success were the ones where data engineering (“data producers”) was embedded into the teams (“data consumers”) they served themselves, rather than in isolation. Data governance is increasingly moving into a federated rather than centralised model of serving users. So my 2 cents would be to embed data people within the core units working close to the problems those teams are facing, and transforming all CUs into data-driven teams from within. Then, to complement that, make sure those embedded data engineers have common frameworks across the CUs they serve.

7 Likes

Will, you hit the nail on the head…

This federated model with lightweight centralized core engineering processes, metadata management and technical standards supporting data producers and consumers embedded in Core Units is the best TO-BE model for the DAO. But this requires time to build competences and mature processes so will not happen over the weekend.

I believe that the whole journey should start with a bit more centralized model slowly dissolving in time, rather than every CU sourcing and processing data in isolation. The second approach means bigger costs, higher risks and tons of frustration.

6 Likes

I guess a bit of a roadmap & solution architecting to work from the centralised data model to a federated data engineering solution would the first mission to sort out :wink: .Fully supportive. @tmierzwa Count on me if you need “data consumer” input into the design

3 Likes

Now the hard part is where do we find the person(s) to fill this very niche role which would certainly be a highly sought-after individual.

@tmierzwa It’s great to see a proposal that will help generate and share blockchain data insights. Within the Protocol Engineering Team, I see this as working side-by-side with us to explore and better understand DAI usage, metrics, identifying pools of liquidity when considering L2 or other protocols, doing transaction analysis when needed, as well as liquidation analysis for system improvements etc. Looking forward to working with you in these and other areas.

7 Likes

Following up, because I’d like to see this unit happen.

1 Like

Hi @PaperImperium, thanks for this post :slight_smile:
In the background we are trying to structure the initial team and calculate the budget needs.
It should result in the CU proposal documents in next two weeks or hopefully sooner.
I’ll do my best to make this happen, because I strongly believe it’s a good thing for the DAO.

6 Likes

I’m not an IT professional, but this waste of professional manpower to do average tasks is really so typical, I see it in other professions as well.

Since this post was written, AI assisted programming has become available, opening up new possibilities for more useful and faster day to day operations.

Again, as I’m not an IT professional, I can’t judge how useful this might be in solving this problem, or in speeding up other programming tasks within Maker. It seems to me that this could have as much impact on programming as DEFI has on traditional finance.

I would be interested to hear your views on whether this technology could be useful to you.

1 Like