MIP39c2-SP29: Adding TechOps Core Unit (TECH-001)

MIP39c2-SP29: Adding TechOps Core Unit (TECH-001)

Preamble

MIP39c2-SP: #29
Author(s): @simonkp
Contributors: @georgen, @dumitru, @lukaszb, @wouter
Tags: core-unit, cu-tech-001, mandate
Status: Formal Submission
Date Applied: 2021-12-08
Date Ratified: <yyyy-mm-dd>
Forum URL: https://forum.makerdao.com/t/mip39c2-sp29-adding-techops-core-unit-tocu-001

Sentence Summary

MIP39c2-SP29 adds the TechOps Core Unit (TECH-001) to handle the system administration and technical support needs of the Maker Protocol and its Core Units.

Paragraph Summary

The TechOps Core Unit will handle system administration and technical support needs of Maker Protocol and its Core Units. TECH-001 will strive to improve communication and collaboration between the developers, end users, and other stakeholders by applying DevOps principles to software delivery and first-class technical support.

Specification

Motivation

TECH-001 is looking forward to passionately support secure, reliable and transparent infrastructure in order to improve MakerDAO’s collaboration, agility, and resilience. Stakeholders in the Maker Ecosystem such as the Maker Community, the Maker Governance, other CUs, and users of the Maker Protocol (DAI holders) know that they can rely on an experienced team to set up and securely maintain their infrastructure.

Core Unit ID

TECH-001

Core Unit Name

TechOps

Core Unit Team

The essential factor of TechOps services is 24/7/365 support. This requires proper coverage across all time zones. The Core Unit team already has a total time zone coverage with Engineers in the Americas (1 Engineer), Europe (2) and APAC (1). We plan to hire another 3 Engineers in the near future to improve time zone coverage, namely in the Americas, Europe and APAC.

Another important factor of TechOps services is cost effectiveness. Combining these two factors together (the 24/7/365 support and cost effectiveness), the team will also provide services for other Core Units and commercial entities (subject of separate arrangements), and settle the accounts based on FTE (Full Time Equivalent) allocation.

The structure below presents the current team structure built around the mission of providing technical support services to the MakerDAO Ecosystem.

Role People FTE
Facilitator/DevOps Engineer 2 1.6
DevOps Engineer 2 1.6
Project Manager 0 0.0
Operations Consultant 1 0.2
TOTAL 6 3.4

The structure below presents a prospective team structure with planned additional hires.

Role People FTE
Facilitator/DevOps Engineer 2 1.6
DevOps Engineer 5 3.4
Project Manager 1 0.5
Operations Consultant 1 0.2
TOTAL 9 5.7

Core Unit Facilitator - 2

  • Communications with Governance and MakerDAO Community
  • Agile workload management
  • Managing budget and strategy
  • 24-hour availability due to time zones coverage
  • No single point of failure
  • The Facilitators will also have the responsibilities of the DevOps Engineer role

Engineering - currently 4

  • Securely manage existing and provision new infrastructure
  • Operate the internal support desk
  • Select, secure, and manage administrative cloud services
  • Manage logging, monitoring, detection and recovery of hosted services
  • Software development and life cycle support
  • 24-hour availability due to different time zones coverage
  • No single point of failure

Project Manager - 0.5 FTE

Will be looking to hire in the near future. The main objective is to relieve the Facilitators to focus more on engineering tasks and communications with the community.

Core Unit Mandate

Mission

To provide technical support services to MakerDAO stakeholders and liaison with external service providers while ensuring the effectiveness, reliability and security of the MakerDAO infrastructure layer.

Vision

TECH-001 is a team of passionate professionals with quality-first attitude, extensive experience in the Maker Ecosystem and a heavy interest in the always developing Web3 space.

The infrastructure we deliver is reinforced by:

  • Reliability - Secure and reliable operations, resulting in high service availability, robust monitoring, and regular safe deployments
  • Support - 24/7 detection and incident response with high level of redundancy between critical components and team members
  • Transparency - Accessibility and transparency to other CUs and the broader MakerDAO community. Stakeholders will be kept informed regularly about the state of the infrastructure, its cost structure and the tradeoffs involved

TECH-001 closely collaborates with the stakeholders of the Maker Ecosystem facilitating:

  • Education - TECH-001 properly educates stakeholders about operational security best practices and regularly reviews them for improvements
  • Point of contact - TECH-001 facilitates incident response and acts as a first line of support for external security researchers and Maker Ecosystem participants
  • Integration - New participants in the Maker Ecosystem and third-party integrators can access and reuse infrastructure scripts and recipes from the service catalogue created and maintained by TECH-001

Strategy

By following the rapid continuous improvement and innovation cycle between TechOps and other Core Units, we strive to improve communication and collaboration between the developers, end users, and other stakeholders (that is, all participants in the MakerDAO Ecosystem, not just technical Core Units) by applying DevOps and SRE (Site Reliability Engineering) principles.


Source: atlassian.com

Plan

  • Establish clear priorities: Ensure that urgent or time sensitive work is prioritised appropriately
  • Conduct peer code reviews: A minimum of two other engineers looking at every code change
  • Limit WIP (Work in Progress): Minimise context switching and improve quality of work
  • Knowledge sharing sessions: Analyse our performance and update the process as necessary

Provision Infrastructure with Infrastructure as Code

  • Easy to understand and share with others
  • Simple and fast to change, upgrade, and scale
  • Fast feedback from problems

Service Building & Continuous Delivery

We work closely with the developers (and other Core Units) to provide them with service delivery pipelines for their service’s code repositories. This allows them to get their work done in smaller batches and automatically deploy new changes, leading to higher quality software and faster feedback from tests and the user.

Monitoring Applications, Protocol & Infrastructure

Providing monitoring components such as:

  • Performance monitoring
  • Insight into system components
  • Metrics, Logs and Dashboards
  • Alerting infrastructure

Continuous Feedback & Transparency

  • From the Stakeholders: Regular meetings for a chance to hear from the stakeholders with any feedback
  • Metrics and Logs: Monitoring systems setup to gather a constant stream of data to improve our positioning for a more reliable infrastructure

Read the full version of our Mission, Vision and Strategy

Products and Services

The main areas of responsibilities are:

  • Hosting and supporting Critical Maker Components in collaboration with other CUs
  • General System Administration
  • DevOps Services to deliver services reliably
  • Research and Development to future proof the infrastructure we manage

Note: Due to the limited number of FTE resources and costly cloud infrastructure (see Budget MIP), it is up to TECH-001 to triage incoming requests and, in some cases, to declare work as out of scope and infrastructure not critical. Basically TECH-001 will not use DAO funds lightly to support non-critical infrastructure.

Critical Maker Components

The TechOps Core Unit will maintain a registry of critical Maker infrastructure and support the activities that it will offer. The team will work with the relevant stakeholders (for example, the Immunefi Security Core Unit, IS-001) to keep this list up-to-date and ensure that it evolves with the changing needs of the MakerDAO Ecosystem and the protocol.

The initial list is included below:

  • Collaboration with other Core Units
    • Protocol Engineering (PE-001) - Ethereum nodes provisioning, administration and monitoring

    • Oracle (ORA-001)

      • Administration, Monitoring and new Collateral onboarding
      • Addition of each new collateral type touches various systems that need to be reconfigured to accept said collateral. Systems such as Oracle Feeds and Relayers, monitoring, dashboards, keepers, changelog and spell whitelisting.

      Note: A different MIP will be put forward to the community to transfer the ownership of the current Oracle Feed and Relayers from the Maker Foundation to TECH-001.

    • Development & UX (DUX-001) - GovPollDB hosting & maintenance

    • GovAlpha (GOV-001) - Monitoring spells and voting

    • Immunefi Security (IS-001) - Runbooks for incident response and emergency procedures
    • VulcanizeDB - Maintenance, hosting, and new collateral support
    • SAI API - Legacy support

Critical Maker protocol components such as:

  • Flap auctions

  • Forum*

  • Website*

  • Blog*

  • Discord administration

  • Various keepers - open source services to facilitate Maker smart contracts operations

    * Note: TECH-001 will only take responsibility for the infrastructure hosting exclusively as Platform Manager, and will not be responsible for the content hosted on these platforms.
    The team will work with the community and follow the procedures laid out in MIP60 to establish Content Managers for the respective platforms.
    At the very minimum the contact details of the Content Manager need to be publicly available on the platform. Platforms without a designated Content Manager will be taken offline after a notice period of 70 days.

    ** Note: Gas costs are not included in our MIP40c3-SP28 budget proposal. TECH-001 will work with the relevant stakeholders and MakerDAO governance to organise the provisioning of ETH to cover these costs. TECH-001 will at any time be managing these funds in service of the MakerDAO community and will at no time take ownership of the assets that are involved in the process. The Maker Protocol will be added as beneficiary in the multisigs that are part of this setup wherever practically feasible.

  • Infrastructure Monitoring & Alerts: Dashboards, response to alerts and Reports

    Note: TECH-001 is not responsible for any actions taken from looking at the metrics presented through the dashboards hosted by TECH-001. All the data is public and available to be consumed by everyone.

  • Development & QA

  • Infrastructure design, CI/CD pipelines and staging environments

  • Technical Support: Support CUs we work with, to set up accounts, helping with infrastructure setup and configuration, security best practices education, and so on. 24/7 follow-the-sun support, assessment and remediation

  • Tools administration and development: 3rd party services administration and management, such as:

    • Development of Discord bots and integration with other services
    • Grafana dashboards development, which are then made available to others
    • PagerDuty alarms development and integration with other services
    • Development of various keeper services (those listed above and others per community needs)
    • Maintenance of projects used by the keepers (pymaker, pygasprice-client)

General System Administration

  • Infrastructure hosting for the CUs that we work with
  • Documentation of critical components
  • Cloud providers management
    • Multiple cloud providers to prevent lock-in, add pricing options and introduce fault tolerance. Automated with Infrastructure as Code.
  • Network & Security
    • Virtual Private Cloud (VPC) and Firewall management
  • Load balancing
    • Dynamic upstreams and SSL certificates automation
  • Testing (services and infrastructure)
  • Database administration
  • Secrets and service credentials management
  • Backups & Restore
    • Database
    • Stateful services filesystem
    • Regular automated restore tests

DevOps Services

  • Infrastructure as Code Automation
    • Cloud Environment provisioning
    • Cost management & Optimisations
  • Source control and artefacts management
    • GitHub and Docker repositories
  • CI/CD - Setting up automated delivery and testing pipelines to deploy Maker services to various environments with confidence
    • Github Actions and other CI systems
  • Monitoring, Metrics & Alerts
    • Various server metrics
    • Service availability and performance monitoring
  • Log Aggregation
    • Centralised log data storage for easy developer access, analysis and optional alerting
  • Knowledge Sharing and Training
    • Expected to be continuously learning
    • Provide environment for safe experimentation
    • Regular knowledge sharing presentations within the team and to outside stakeholders

Roadmap and R&D/POC

  • Ongoing improvements of the cloud infrastructure redundancy by further diversifying availability zones, regions, and service providers
  • Explore new decentralised infrastructure architectures introduced by Web3, e.g., IPFS
  • Eth2 nodes administration
  • Container orchestration on Kubernetes
  • Chaos Engineering implementation for testing the redundancy of components
  • Mapping and establishing Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Made popular by Google, SLOs are a tool to help determine what engineering work to prioritise, therefore increasing reliability of services

Related Documents

8 Likes

Sorry I meant to add this post here and not the DAI budget post.

Super excited to hear you’ll be working with Forta (innovative). Will you also deploy agent scripts to monitor new code, and/or on which components? (code, API integrations etc.) Like wondering how you’re thinking about threat monitoring.

Also, can you please talk about how you’re thinking about adopting decentralized serverless-architecture (hopefully in the near future) and the integrations of innovative tools/activities–like AI/ML.

And last but not least, Flip Flop Flap Delegate is a supporter of the “Clean Money” Initiative and I’m wondering if your team has thought about a holistic approach to automate components and on a human level, avoiding personal burnout. I see you folks have a 2-facilitator structure and wondering if you can expand on that. :slight_smile:

Thank you in advanced, and for providing this MIP40 application to the Maker Community.

So happy to see this proposal! Tech Ops is the best! I like the idea of having 2 facilitators, I’m not sure if any other CUs have this structure but it definitely makes sense in this scenario.

3 Likes

Hi @ElProgreso ,

Thank you for your questions and input.

We already started working with the Forta team and Nethermind which wrote code for a couple of agents that we suggested and tested in the first phase. Forta agents are designed to watch on-chain events (that is they are exercised and evaluate transactions on each block mined) so they’re not quite suitable for off chain monitoring like API integrations - but we can check with the Forta team if that’s something that they considered. As part of the Proof Of Concept we suggested and tested agents for monitoring:

  • Oracles Security Module (checking that price update happens at the top of each hour, price deviation between current and queued price does not exceed a certain threshold, notify on rely / deny events)

  • Emergency Shutdown Module (events like MKR deposited in ESM contract or ESM triggered event)

  • Governance Module (monitor events like new hat lifted and making sure it’s a valid one, checking the MKR on hat does not fall behind a certain threshold, )

Looking further we’d like to accommodate other agents like watching big movements of MKR tokens (that could signal potential governance attacks), MKR liquidity on DEXes, etc.

As part of the TOCU mission we’re going to elaborate and propose a list of agents to the community that we can start with and improve, then we’ll supervise the entire cycle from development, testing and publishing agents to running them.

To note that most of the above is already covered by existing monitoring setup (that was built in the last years since MCD launch) - by deploying Forta agents we’re looking to complement this setup and to improve the protocol’s resiliency in a decentralized way.

As stated in the proposal, TechOps is using infrastructure as a code approach, where most of the configuration and setup that we do is done through code updates. This code is currently in the private repositories, but we have plans to make it public and available to everybody. We haven’t made any steps towards AI or ML yet mostly due to being busy with multiple projects and tasks. We were thinking about that, but at the moment I would not make any promises.

We do use an automation approach, where most of the tasks and jobs run without human interaction. Some tasks have been automated from being done manually before to being automatically done through programmatic code. However, frankly speaking, doing many updates whenever there is a change in the system, still has to be done manually. We recognise that this is an unavoidable part of work and takes our time away from more important tasks.We have plans to improve some of these processes, e.g. take new contracts on-chain, use auto-updates in routine tasks, etc.

You are right that we proposed two facilitators - myself (Dumitru) and Simon. That is done due to the “always on” nature of the team. Almost all our work is based on 24/7/365 monitoring, monitoring infrastructure, creating environments, deploying code, and improving CI/CD processes. Everything has to be done without interruptions. The rest of the team is also created (and planned to expand) in a similar way - to cover all the time zones, day and night. We believe that having two facilitators (in EU and APAC regions) right from the Core Unit start will consolidate the team’s global presence, solve the ‘team being disconnected’ problem, as well as make the team constantly listening and reporting to the community.

Let me know if you have any questions to the above. I would be very happy to answer.

2 Likes

Thank you for the thorough response Dumitru. Very much appreciated.
BTW, really liking the “TOCU” acronym :slight_smile:

Isn’t there a rule that when you make a core unit, you can’t assume other core units will want to work with you?

Have these other core units stated anywhere that they would like to partner with tech ops?

1 Like

Hey @Zarevok, thanks for taking the time to go through TOCU MIPs. First, by definition a Core Unit is a single component of a greater system. So if we’re making up rules, I’d say we should definitely assume future collaboration between CUs. Second, the team is already collaborating with the CUs mentioned. Perhaps this can be stated more clearly.

To be fair I didn’t make the guideline up.
I’m not able to find the exact source but I do recall reading the argument in something published by @SES-Core-Unit. Essentially they argued how it would not be scalable to per say have every core unit directly interface with Protocol Engineering. Obviously collaboration will occur if both core units are aligned however one core unit can’t assume another core unit is required to collaborate.

Can you better articulate the services you are already providing? Could some of the other core unit facilitators you are working with substantiate your claims?

While there are many statues of Monarchs, Politicians, Generals, Actors and Poets there are few statues of Engineers even though engineers in many cases have created far lasting impacts that have saved lives, and improved living conditions for millions of people.

Techops is in the engineering category. There is in general not a lot of awareness of how important TechOps is for a robust and reliable digital infrastructure.

While TechOps hasn’t craved a lot of attention it should be clear that the Maker Protocol is set for failure sooner or later without a reliable and efficient Techops support.

I have worked with most of the team behind this proposal in Maker Foundation. They have consistently delivered above expectation without drawing a lot of attention. They have pulled through with a Can-Do spirit and have never scoffed at the work they were asked to perform. When the PagerDuty alarms came though during the night and weekends (of which we have had quite a few) the team always were first responders and stayed on until the incident was resolved.
This team has my best recommendation

5 Likes

Tldr another maker foundation legacy core unit that gets a big overpriced package to compensate them for bootstrapping the project.

Got it.

For the record, I don’t necessarily think that’s a bad thing under that context.

That being said, I’m curious… who exactly owns the TecOps CU? Is it owned by the CU facilitator + team or do they work for another entity?

1 Like

The service descriptions you see next to each Core Unit in the above MIP is exactly what we’re providing at this time.

The CU doesn’t have an owner as it’s the DAO abstraction. That being said, there is a commercial entity with whom the team is employed in the more traditional sense. As stated in the MIP above, the team will provide services to other CUs and commercial entities (subject of separate arrangements), and settle the accounts based on FTE (Full Time Equivalent) allocation.

A couple of general updates for everyone:

First, we have George (@georgen) who is planning to take a break starting February 1, 2022. Because of that, we will revise the plan of looking for the new team members in the new year from the original 3 to 4:

  • 3 DevOps Engineers
  • 1 Project Manager

Therefore the number of current FTEs and the MKR vesting table were updated accordingly to reflect the above.

Second, during the TOCU Pod Session call we’ve received some valuable feedback from the Community. One of the things that was mentioned is a public roadmap. Since we are not a product team and most of our work is related to supporting the protocol’s health, it makes it challenging to predict the services that we’ll be offering down the road. However, we decided to come up with a list of things that are mainly POC and R&D focused that we plan to achieve in the next year after TOCU is planned to start operating.

We should be finalising a version presented together with the official MIP sometime in January 2022. We will also have to chose a platform where we have the roadmap published for best consumption. Right now we’re choosing between Notion, Canny and a couple of others.

As always, keep the questions coming and happy holidays!

Can you share the name of the commercial entity?
Which country is the entity based out of?
Is this commerical entity owned by members of the core unit or is it a subsidiary of another entity?

Hey @Zarevok, great questions, keep them coming.
There are still some a few things we’re sorting out with the commercial entity, mainly because it is brand new, but I can tell you it is called Techops Services, it is registered in Estonia and is owned by the members of the CU.

1 Like