2 weeks ago, due to a power failure, a harddrive in a (UPS protected) server at VTK crashed. Unable to boot the machine, unrecoverable IDE errors,...
The machine was only used as a database server for now (both MySQL and PostgreSQL), and did not make external backups (yet). Yeah I know, stupid, but managing our network isn't that easy.
Anyway, the harddrive consisted of 3 partitions:
- /boot, ext3
- /, ext3
- an LVM2 PV
Both /boot and / were not of any use (except maybe some files in /etc), but the database files inside the LVM2 VG were pretty important, so they had to be recovered.
Do note all 3 partitions were the first part of a RAID1 volume, a second disk still had to be added.
Here's the recovery procedure, could be useful for some people out there:
- Get the harddrive out of the machine (other parts of it might be broken too, we want to use other hardware)
- Put the disk in a second machine, including another harddrive, big enough to contain images of all recovered partitions. This means this hard drive might need to be bigger than the original one!
- Run Linux on the machine. I had no Linux box around, so I used Knoppix (as I knew it had all tools I need)
- Make sure you got dd_rescue on the box (Knoppix got it), and dd_rhelp. In Knoppix, just download the tarball, extract it, run ./configure && make && sudo make install
- Become root: sudo su -
- Mount the new harddrive, first creating partitions/filesystems as necessary. I mounted it on /mnt/sda1
- cd /mnt/sda1
- dd_rhelp /dev/hda4 hda4-lvm-pv.img
(where hda4 is the PV's partition) - This can take an enormous amount of time, get some coffee, sleep, a life, whatever
- When the process is done, poweroff the machine (in a clean way). At this point, the broken harddrive was even more broken: fdisk -l /dev/hda didnt even show hda4 any longer.
- Take the broken harddrive out of the machine, and make a backup of the image you created, on another harddrive, a network share,... I used another harddrive.
- Remove the hard drive on which you made the backup copy from the system (just in case).
- Now real recovery can start. Boot the machine, mount the partition on which you made the image.
- From this point, all depends on how broken the disk was. As only some minor part of the 120GB PV contained important data, I was able to recover everything necessary.
- Export the image as if it's a block device:
losetup /dev/loop0 /mnt/sda1/hda4-lvm-pv.img
Now /dev/loop0 is just like the old /dev/hda4 - As the image is taken from a RAID-volume part in this case, I had to remove the RAID signature to let the LVM2 tools recognize it as an LVM2 PV:
mdadm --zero-superblock /dev/loop0 - Let the system search for LVM2 PVs:
pvscan
This should show you it found some PV - Activate the volume groups:
vgchange -a y - Now the old LV's should be back. If your VG used to be called "vg":
ls /dev/vg - fsck the necessary partitions:
fsck /dev/vg/var-lib-mysql
If this passes nicely, you're almost saved - mkdir /mnt/mysql
mount /dev/vg/var-lib-mysql /mnt/mysql - Now you can use the files in /mnt/mysql to regenerate all necessary services. As I needed the databases to be up and running ASAP, I did a quick Debian install, using MySQL 4.1 (which was also the version running on the crashed server), and moved the recovered mysql database files to the correct locations. Some "check table", "myisamchk", "myisamchk -r" and "repair table"s later, the databases were up and running nicely
That's about it... Obviously, everything should be adapted to your specific setup/situation. The MySQL recovery might not work at all when the MySQL versions and binary formats defer, or the files are damaged too much. See the application-specific documentation for recovery techniques.
The new harddrive we bought to put images on, crashed itself too after 5 hours of operation. Luckily it could be replaced ;-)
For the record: the first broken harddrive was a Maxtor, 120GB PATA (not bought by me), the second one a Seagate 320GB PATA (bought by me, although I wanted to get a WesternDigital, but there were no more in the shop, and I needed a drive quickly). This Seagate was then replace by a WD 320GB SATA2 (which forced me to buy another PCI SATA controller), which works fine. Now don't start a brand-war.
/me is pretty happy all data is back, after all... Now up to reinstalling some webservices that were running on the crashed server.