Start: 2019-11-18 16:00:04 UTC (MCD launch)
End: 2021-04-26 14:02:08 UTC
Authors: Kurt Barry (kmbarry1)
Summary: The End module orchestrates the global settlement of the Maker Protocol after an Emergency Shutdown. Its primary responsibility is allowing the redemption of DAI for approximately 1 USD of backing collateral. As part of running simulation tests against
mainnet state in preparation for the upgrade to LIQ-2.0 on Friday, April 16th, a bug was discovered that would have prevented the autonomous completion of the End process. Specifically, a numeric overflow would cause a revert when calculating the amount of a given collateral to disburse per DAI in
End.flow(). The exact condition to trigger this is that the amount of a collateral needed to cover the DAI it backs exceeds approximately 115,792. In particular, this was the case for the
ETH-A collateral type, which backed the largest amount of DAI of any collateral at the time. The End process would be unable to proceed for any overflowing collateral type and thus that particular collateral could not be permissionlessly distributed.
Impact: Liquidations 2.0 launch was delayed for a few days while a fix was developed.
Root Causes: When MCD was designed and implemented, the overflow potential of a multiplication of a
ray and a
rad was overlooked. The pre-launch audits found several such issues, and a few months post-launch, a
ray * rad overflow forced an upgrade of the Cat contract. Motivated by these findings, a task was created on 2020-09-09 to audit all multiplications in MCD. This task was marked complete on 2021-01-05. However, that internal code audit failed to catch the specific multiplication that led to this postmortem. Beyond “fallability of human auditing”, the cause of this mistake is conjectured to be because this overflow occurred as a result of a precision boost within a safe math function, introducing a non-locality in deducing correct behavior (i.e. a need to consider multiple code sites at once when deciding on safety).
Extensive testing was performed on the Emergency Shutdown process–not only using unit tests, but also using the cage keeper on a private
testnet and the public
kovan testchain, and manually on
mainnet on a secret pre-launch deployment of MCD. However, none of the tests used amounts of DAI and collateral large enough to trigger this bug.
Trigger: Emergency Shutdown when more than a certain amount of a collateral is reserved for disbursment based on its price, the amount of DAI it backs, and the collateralization levels of associated Vaults.
Resolution: The overflowing calculation was restructured such that much larger collateral amounts could be gracefully handled with minimal accuracy degradation. A new End contract with this change was deployed and included in the executive spell that added Liquidations 2.0. This spell executed on Apr 26 2021 14:02 UTC, fully resolving the issue.
Detection: Discovered when RPC tests written for the MIP45
mainnet spell failed unexpectedly.
|Action Item||Type||Owner (GH handle)||Ticket|
||detect||godsflaw and brianmcmichael||524|
||detect||godsflaw and brianmcmichael||525|
|Define expected magnitude ranges for all
|Design and implement over/underflow verification for
|Add developer documentation about the risk of
What Went Well
- The bug was caught by our own testing procedures.
- We were able to reach high confidence in a fix quickly.
- Everyone worked hard and collaborated well in a stressful situation–we were able to coordinate implementation, review, and testing over a single weekend.
- We were able to minimize the delay to the Liq-2.0 initial rollout, pushing back only the initial spell a few days and sticking to the rest of the schedule (combination of quick execution, a simple fix, and needing to replace the End for Liq 2.0 anyway).
- No emergency shutdown was necessary while the bug was triggerable.
What Went Wrong
- The bug made it into the live system, and was only one precondition (initiation of ES) away from triggering.
- We were forced to further delay Liquidations 2.0 launch (but only by 3 days).
- An internal audit intended to catch and eliminate all such bugs failed to catch this instance.
- All previous tests of Emergency Shutdown failed to explore this case, despite it occurring at practical magnitudes.
- The original MCD audits missed this issue (despite finding other potential
Even though several examples of this type of bug had already been discovered, a code audit performed by a skilled engineer failed to catch this issue. The original MCD audits also missed the issue. Ideally, programmatic methods should be applied in addition to human auditing.
Define Expected Magnitudes, and Incorporate Such Assumptions Into Development, Verification, and Testing
Given the effort that went into formally verifying MCD, it would have been relatively simple to add logic for checking that no overflows occurred so long as all values fell within realistic numerical ranges. Less comprehensive but still likely-effective techniques such as unit tests with very large (and/or very small) values or randomized testing (“fuzzing”) could also have been used to probe magnitude-safety.
2019-11-18: MCD system launched with the overflow-prone End implementation.
2019-11-24: Earliest date the bug was likely to trigger during an Emergency Shutdown (script used to determine this).
2021-04-16 (all times UTC):
- around 15:30: gbalabasquer tags kmbarry1 in a comment on the executive spell PR noting a failure in a test for the replaced End module and conjecturing what the underlying issue might be.
- around 15:45: kmbarry1, suspecting a
ray * radoverflow issue, joins the ongoing spell review call to discuss it, initially with gbalabasquer and brianmcmichael. The bug is quickly confirmed.
- 16:00 onwards: the rest of the SC domain team, as well as a few Foundation and community stakeholders, are brought onto the call to discuss the mitigation plan.
- around 17:00: consensus is reached to delay Liq-2.0 launch as little as possible, and use a limited initial disclosure as the reason.
- around 22:00: after further async discussion, the SC domain team decides describe the delay as due to a “technical issue”, avoiding the word “bug” or any indication of the nature of the issue, to ensure a fix can be developed without undue pressure.
- 22:18: a forum post is made announcing the delay.
- later in the evening: kmbarry1 writes a fix and tests, sending a PR to a private repository for others to review.
2021-04-17: SC domain team members begin reviewing and testing the fix; a first draft of the disclosure forum post is made.
2021-04-18: reviewing and testing of the fix continues.
2021-04-19 (all times UTC):
- 13:00 - 19:00: the finalized fix is merged, new contracts are deployed, the executive spell is modified/reviewed/tested, and the disclosure forum post is reviewed and revised.
- 19:14: the bug disclosure forum post is published.
- 19:50: the executive spell is deployed. After a brief testing period, it goes live in the governance portal.
2021-04-22 1:32: the spell overtakes the
hat and is scheduled.
2021-04-26 14:02: the spell is cast, fully mitigating the issue.
Script used to estimate earliest date at which this bug would have affected an Emergency Shutdown:
Forum post announcing the last-minute delay due to a “technical issue”:
Forum post disclosing the issue and fix: