11/25/20 Oracle Outage Post-Mortem
On Wednesday, November 25th between 22:46 UTC and 02:23 UTC the following morning, the Maker Oracle Protocol experienced an outage. During this time, no Oracles were updated. The outage was due to a hardcoded “state size” limit in Scuttlebot, a p2p gossip network protocol utilized by the Maker Oracle Protocol. No liquidations were affected during this time.
Root Cause Analysis
Scuttlebot is similar to a blockchain in that messages are hash-linked and stored locally by every peer on the network. As more messages are gossiped on the Scuttlebot network, the state size grows larger, which is reflected in the size of a file called log.offset. Since Feeds and Relayers are connected to have complete visibility of each other in the network, state size will tend to be relatively consistent across different peers. Scuttlebot has a hardcoded limit for log.offset of 4294967295 bytes (4.29 GB) after which the Scuttlebot server will crash and refuse to restart.
The Oracle Domain Team has been aware of this issue since September 6, 2019 when it had encountered it in testing prior to the release of Multi-Collateral Dai. Back then the team had opened a Github issue in the Scuttlebot repository documentation the bug. The issue was tracked down to a limitation of an OffsetCodec that was used by Scuttlebot.
Calculations were made estimating how long it would take the log.offset to reach 80% of its maximum threshold given the number of Feeds, the number of collateral types, and assuming the frequency of price updates was equivalent to that seen during the previous year of testing. These calculations led the Oracle Domain Team to estimate a figure of approximately 18 months before a migration to a new Scuttlebot network would be necessary. Based on this figure, the team correctly prioritized Transport Layer redundancy with a timeline that would precede the estimated recurrence of the Scuttlebot bug.
Further analysis of what caused the state size to grow much quicker than initial estimates is still ongoing. One factor is that the frequency of price updates during the time since Multi-Collateral Dai launched is significantly greater than during the preceding year of testing. This could be caused by cryptocurrencies being much more volatile in 2020 than in 2019. Another factor may be that performance optimizations in the Oracle implementation to reduce the latency of fetching prices led to more frequent price updates than the estimate accounted for.
Mitigations and improvements
Since the initial launch of the Maker Oracle Protocol, the Oracle Domain Team roadmap has been focused on an initiative called “Oracle De-Risk” which seeks to mitigate Oracle outages through improving the resiliency of the Oracle. The Maker Oracle Protocol today has full infrastructure redundancy. Any Feed, Relayer, or data source can go offline without affecting the availability of the Oracle. This is a critical feature, showcasing the decentralization of the Maker Oracle Protocol. The key next step is to achieve implementation redundancy, such that no dependency exists on a single implementation for any component of the Oracle stack (price sourcing, transport layer, feed client, relayer client, etc.).
Over the past year, the Oracle Domain Team has made considerable progress on this front. This includes delivering a new price sourcing implementation such that no bug in a single price sourcing implementation affects Oracle availability.
The team spent a considerable amount of time researching alternative Transport Layers to Scuttlebot, ultimately settling on libp2p. Transport Layers are the medium through which Oracle Feeds distribute signed price messages to Relayers. By utilizing multiple Transport Layers in parallel, Oracle availability becomes unaffected should a Transport Layer fail or be attacked.
After extensive research, the team selected libp2p as the prime candidate. The Oracle Domain Team has been working on implementing libp2p as a Transport Layer for the Maker Oracle Protocol for all of Q4. A functional prototype has been completed, and a production-grade implementation is in the later stages of development.
However, the scuttlebut bug recurred before our calculated estimates and we were unable to push the completed libp2p integration into production in time to mitigate the issue. This is quite frustrating, in that if the timing of the Scuttelbot bug was delayed by just a month, libp2p would have been running in production and the scuttlebot failure would not have resulted in an Oracle outage. The Oracle Domain Team was precisely aware of the problem in the Maker Oracle Protocol, the solution required to solve the problem, and was in the midst of executing on that solution.
In order to prevent the recurrence of this issue, the Oracle Domain Team has put in place real-time monitoring solutions for the Scuttlebot state size. This should give advanced warning of any unexpected state size increase that could lead to triggering the Scuttlebot bug. The introduction of libp2p into the Maker Oracle Protocol is the more permanent solution and is expected to be released into production early next year.
Wednesday 20:58 A community member notifies The Oracle Domain Team of an Oracle Relayer failure.
Wednesday 21:09 The Oracle Domain Team confirms the cause is related to a known issue associated with Scuttlebot. The team begins investigating whether the same issue could potentially affect other nodes in the Scuttlebot network.
Wednesday 21:15 The Oracle Domain Team begins reaching out to Feeds and Relayers to request information to assess the situation.
Wednesday 21:31 After confirming the size of the log.offset parameter with several Feeds and Relayers it is determined the issue is critical, time-sensitive, and may lead to a loss of quorum (the number of Feeds needed to update the price of an Oracle). Calculations show some Feeds will begin to go offline due to the issue within an hour.
Wednesday 21:55 After comparing potential solutions, the Oracle Domain Team makes the decision to execute an emergency migration. The Oracle Domain Team alerts all Feeds of the critical issue and to be on standby for an emergency migration.
Wednesday 22:21 The first Oracle Feeds begin to drop offline.
Wednesday 22:25 The Oracle Domain Team completes bootstrapping a new Scuttlebot network.
Wednesday 22:46 Enough Feeds have dropped offline from the critical scuttlebot issue that quorum is lost. At this point all Maker Oracles stop updating.
Wednesday 22:55 The Oracle Domain Team begins reaching out to all customers of the Maker Oracles informing them of the outage.
Wednesday 23:12 The Oracle Domain Team has completed the emergency migration documentation and tested it to verify completeness and correct behavior. The documentation is shared with all Oracle Feeds.
Wednesday 23:34 The first 3 Feeds have completed the emergency migration and are fully functional on the new Scuttlebot network.
Thursday 02:17 13 Feeds have completed the emergency migration, quorum is restored (the Maker Oracle Protocol is online again).
Thursday 02:23 A relayer submits the first price update to the ETH/USD Oracle from the new Scuttlebot network.
Thursday 02:30 Relayers begin updating all other Oracles from the new Scuttlebot network.
Thursday 02:36 The Oracle Domain Team begins informing customers that the Maker Oracles are fully operational.
Thursday 11:31 12 hours after the migration began, 21 out of 26 Feeds have completed the emergency migration.
Tuesday 10:35 All 26 Feeds have completed the emergency migration.