End.flow() Overflow Bug Postmortem

Start: 2019-11-18 16:00:04 UTC (MCD launch)

End: 2021-04-26 14:02:08 UTC

Authors: Kurt Barry (kmbarry1)

Status: Resolved

Summary: The End module orchestrates the global settlement of the Maker Protocol after an Emergency Shutdown. Its primary responsibility is allowing the redemption of DAI for approximately 1 USD of backing collateral. As part of running simulation tests against mainnet state in preparation for the upgrade to LIQ-2.0 on Friday, April 16th, a bug was discovered that would have prevented the autonomous completion of the End process. Specifically, a numeric overflow would cause a revert when calculating the amount of a given collateral to disburse per DAI in End.flow(). The exact condition to trigger this is that the amount of a collateral needed to cover the DAI it backs exceeds approximately 115,792. In particular, this was the case for the ETH-A collateral type, which backed the largest amount of DAI of any collateral at the time. The End process would be unable to proceed for any overflowing collateral type and thus that particular collateral could not be permissionlessly distributed.

Impact: Liquidations 2.0 launch was delayed for a few days while a fix was developed.

Root Causes: When MCD was designed and implemented, the overflow potential of a multiplication of a ray and a rad was overlooked. The pre-launch audits found several such issues, and a few months post-launch, a ray * rad overflow forced an upgrade of the Cat contract. Motivated by these findings, a task was created on 2020-09-09 to audit all multiplications in MCD. This task was marked complete on 2021-01-05. However, that internal code audit failed to catch the specific multiplication that led to this postmortem. Beyond “fallability of human auditing”, the cause of this mistake is conjectured to be because this overflow occurred as a result of a precision boost within a safe math function, introducing a non-locality in deducing correct behavior (i.e. a need to consider multiple code sites at once when deciding on safety).

Extensive testing was performed on the Emergency Shutdown process–not only using unit tests, but also using the cage keeper on a private testnet and the public kovan testchain, and manually on mainnet on a secret pre-launch deployment of MCD. However, none of the tests used amounts of DAI and collateral large enough to trigger this bug.

Trigger: Emergency Shutdown when more than a certain amount of a collateral is reserved for disbursment based on its price, the amount of DAI it backs, and the collateralization levels of associated Vaults.

Resolution: The overflowing calculation was restructured such that much larger collateral amounts could be gracefully handled with minimal accuracy degradation. A new End contract with this change was deployed and included in the executive spell that added Liquidations 2.0. This spell executed on Apr 26 2021 14:02 UTC, fully resolving the issue.

Detection: Discovered when RPC tests written for the MIP45 mainnet spell failed unexpectedly.

Action Items:

Action Item Type Owner (GH handle) Ticket
Create mainnet integration test suite including ES coverage detect godsflaw and brianmcmichael 524
Automate periodic mainnet integration testing detect godsflaw and brianmcmichael 525
Add the ray * rad pattern to the Maker code review checklist prevent kmbarry1 526
Define expected magnitude ranges for all dss parameters and quantities process kmbarry1 527
Design and implement over/underflow verification for dss based on the expected magnitude ranges defined in the previous AI prevent WilfredTA 528
Add developer documentation about the risk of ray * rad overflows to help integrators avoid this type of mistake prevent hexonaut 529

Lessons Learned

What Went Well

  • The bug was caught by our own testing procedures.
  • We were able to reach high confidence in a fix quickly.
  • Everyone worked hard and collaborated well in a stressful situation–we were able to coordinate implementation, review, and testing over a single weekend.
  • We were able to minimize the delay to the Liq-2.0 initial rollout, pushing back only the initial spell a few days and sticking to the rest of the schedule (combination of quick execution, a simple fix, and needing to replace the End for Liq 2.0 anyway).
  • No emergency shutdown was necessary while the bug was triggerable.

What Went Wrong

  • The bug made it into the live system, and was only one precondition (initiation of ES) away from triggering.
  • We were forced to further delay Liquidations 2.0 launch (but only by 3 days).
  • An internal audit intended to catch and eliminate all such bugs failed to catch this instance.
  • All previous tests of Emergency Shutdown failed to explore this case, despite it occurring at practical magnitudes.
  • The original MCD audits missed this issue (despite finding other potential ray * rad overflows).

Human Auditing Is Unreliable

Even though several examples of this type of bug had already been discovered, a code audit performed by a skilled engineer failed to catch this issue. The original MCD audits also missed the issue. Ideally, programmatic methods should be applied in addition to human auditing.

Define Expected Magnitudes, and Incorporate Such Assumptions Into Development, Verification, and Testing

Given the effort that went into formally verifying MCD, it would have been relatively simple to add logic for checking that no overflows occurred so long as all values fell within realistic numerical ranges. Less comprehensive but still likely-effective techniques such as unit tests with very large (and/or very small) values or randomized testing (“fuzzing”) could also have been used to probe magnitude-safety.

Timeline

2019-11-18: MCD system launched with the overflow-prone End implementation.

2019-11-24: Earliest date the bug was likely to trigger during an Emergency Shutdown (script used to determine this).

2021-04-16 (all times UTC):

  • around 15:30: gbalabasquer tags kmbarry1 in a comment on the executive spell PR noting a failure in a test for the replaced End module and conjecturing what the underlying issue might be.
  • around 15:45: kmbarry1, suspecting a ray * rad overflow issue, joins the ongoing spell review call to discuss it, initially with gbalabasquer and brianmcmichael. The bug is quickly confirmed.
  • 16:00 onwards: the rest of the SC domain team, as well as a few Foundation and community stakeholders, are brought onto the call to discuss the mitigation plan.
  • around 17:00: consensus is reached to delay Liq-2.0 launch as little as possible, and use a limited initial disclosure as the reason.
  • around 22:00: after further async discussion, the SC domain team decides describe the delay as due to a “technical issue”, avoiding the word “bug” or any indication of the nature of the issue, to ensure a fix can be developed without undue pressure.
  • 22:18: a forum post is made announcing the delay.
  • later in the evening: kmbarry1 writes a fix and tests, sending a PR to a private repository for others to review.

2021-04-17: SC domain team members begin reviewing and testing the fix; a first draft of the disclosure forum post is made.

2021-04-18: reviewing and testing of the fix continues.

2021-04-19 (all times UTC):

  • 13:00 - 19:00: the finalized fix is merged, new contracts are deployed, the executive spell is modified/reviewed/tested, and the disclosure forum post is reviewed and revised.
  • 19:14: the bug disclosure forum post is published.
  • 19:50: the executive spell is deployed. After a brief testing period, it goes live in the governance portal.

2021-04-22 1:32: the spell overtakes the hat and is scheduled.

2021-04-26 14:02: the spell is cast, fully mitigating the issue.

Supporting Information

Script used to estimate earliest date at which this bug would have affected an Emergency Shutdown:

Forum post announcing the last-minute delay due to a “technical issue”:

Forum post disclosing the issue and fix:

23 Likes

Great work!

2 Likes

Who was/were the external auditor?
Do you have the original audit?

The original MCD audits can be found here: Audit Reports - Security

This code was out-of-scope for the Liq 2.0 audits.

1 Like

Someone of you should twit this out, this industry is mainly based on reputation. Finding overflow seems exactly what they should be expert at.

When you pay for an audit you pay for the reputation and the name they put on.

I think it would be very bad form to try to shame or embarrass the original auditors for missing this, for a number of reasons:

  1. We live in a glass house. The primary responsibility for code correctness lies with those who create it, not the auditors they hire. We missed this before launch in testing and formal verification, and we missed it even after going back over the code specifically looking for this type of issue. It would show a lack of responsibility, integrity, and maturity to point our fingers at our auditors.

  2. It detracts from the real lesson, which is that “auditing is not sufficient to guarantee safety or security”. An audit, even by a top firm, is never an absolute assurance, and the abundant examples of bugs found in audited code confirm this. Rather, we should encourage the mindset that auditing is a defense-in-depth measure, and not something that replaces good processes, good tooling, and intelligent risk-management.

  3. It would be unnecessarily antagonistic. People are quite capable of tracking who audited what and keeping score if they so desire, without us creating drama. Bash your auditors in public (especially without good cause), and you may find it very difficult to retain top firms in the future. If we have an issue with an auditor’s work, we prefer to address that with them privately or simply not engage with them in the future. And to be clear, I do not believe this is such a case. The original MCD auditors did their best to evaluate a large and complex codebase with the time and resources they had, and made important contributions to the safety and security of the system. It’s about the value added by the auditor, not just “Did they miss something?”.

16 Likes

Fair enough,
But I would be very interesting to hear what they have to say about it. It seems to me this type of overflow bug is what an automatic analysis should deter. May be I am wrong, but this seems to me in the top 10 of they check.

edit : ok fair enough, I checked out the code, not that easy. Good job! I am still confused how that can overflowed.

for those interested in it, changed line :
from : fix[ilk] = rdiv(mul(sub(wad, gap[ilk]), RAY), debt);
to : fix[ilk] = mul(sub(wad, gap[ilk]), RAY) / (debt / RAY);

If I understand correctly, it would not overflow but failed cause of the overflowing, therefore the price won’t be settled. right?

Well find, good job! Great quality team.

3 Likes

Yes, that’s right. It was not an uncaught overflow, but rather a calculation done in such a way that an intermediate result would overflow and cause the call to fail, preventing execution of the function (this sort of thing is sometimes called “phantom overflow”). Specifically it happened inside of the rdiv call, when multiplying the first argument by RAY to boost precision.

3 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.