In systems software, crash recovery is self-explanatory: it refers to techniques used to allow systems to recover from crashes, while preserving failure atomicity, that operations are either fully executed, or not at all.

There are a few techniques available to us to ensure failure atomicity on the software level:

  • Shadow copies operates on a working copy of the data to be modified, which is then atomically exchanged at commit-time.

  • Write-ahead logging (WAL)

  • Undo logging

  • Redo logging

  • assume put operation has a unique timestamp across all clients

  • stable storage — storage never fails — in practice done with schemes like RAID

  • flushed — data in DRAM copied to storage — now they’re synced

  • failure atomicity

    • if the operation doesn’t complete — we must revert
    • atomic commit on disk?
  • shadow copy — much like what vim does

    • commit point - done with a “swap” syscall
    • crash at pre-commit — save to disk
    • crash at commit — complete afterwards
  • undo logging

    • example: arrow indicates dependency: change must happen before install
  • redo logging

    • ok if crash while recovery
    • b/c idempotent property?
  • logging costs

    • “log writes almost come for free because they’re sequential”
    • battery backed RAM — only on crash we write to disk