|Dan Hyde & Steve Simmons, University of Michigan||Distributed Backup And Disaster Recovery for AFS|
The umich.edu afs cell is currently 9.4TB and growing at a rate that sees it
double in size every 18 months. Backup and disaster recover have become major
issues. We are currently implementing a disk-based backup system that should
allow the nightly fulls and incrementals to complete in a small number of
hours without ever impinging on production time. This is done by distributing
backups across a number of systems such that ever server can (if needed) have
a dedicated backup host, with all servers backing up in parallel. We expect
this implementation to be in production by June 1, 2006.
Dan and Steve have been designing a system for disaster recovery on AFS servers based on using shadow volumes. There are two core parts to this work - tightening up the definition of shadows and their iteraction with the rest of AFS (and doing the code to support that definition), and implementing the hardware and processes necessary to actually build the disaster recovery system.