The importance of testing recovery

by Jay Stanley, Database Specialists
About Database Specialists, Inc.
Database Specialists, Inc. provides remote DBA services and onsite database support for your mission critical Oracle systems. Since 1995, we have been providing Oracle database consulting in Solaris, HP-UX, Linux, AIX, and Windows environments. We are DBAs, speakers, educators, and authors. Our team is continually recognized by Oracle, at national conferences and by leading trade publications. Learn more about our remote DBA, database tuning, and consulting services. Or, call us at 415-344-0500 or 888-648-0500.

In the course of being a day-to-day database administrator, it's actually easy to forget one of our primary responsiblities; that being, insuring that in the case of some event or disaster, our precious databases can be recovered reliably, quickly, and easily, without any panic. Databases today play a key role in nearly every business today, and in many cases can actually be considered the 'crown jewels' of the company itself. A pharmeceutical company without its database of drug tests; a marketing company without its databases of leads and ongoing business; an electronic component manufacturer without its databases of designs and yield histories; or, an online advertising firm without its information on advertisements and advertising targets, are not a valid business concerns.

As a DBA since before relational databases were used, I've noticed that the pace of change in database environments has sped up in the past 10 years in particular. Oracle is releasing new versions with lots of new features more frequently, and patches – both CPU and PSU's now, are happening more often than before, and each is bigger than the one before. Databases are growing faster than ever throughout recorded history, and this makes the very job of the database admistrator even more busy and stressful than ever. In addition, newer software tools and more development being done in the world today, makes the rate of change of the database design itself – the number of production update changes – higher than it has ever been.

It has always been true in working with computers that the higher the rate of change in a system, the more likely that something will go wrong, sometimes horrifying wrong. The factors that increase the rate of change today contribute to the likelyhood of failure. In my experience, Murphy's Law rules, and disasters will time themselves for the worst possible time. Disasters happen when everyone is asleep or on vacation; not ready to respond, and this makes them even more difficult to respond succesfully.

The volume of data in databases has exploded as well over the past 10 years; what was considered a 'large' database even 5 years ago (say 500Gb), is no longer considered large at all. Disks really haven't sped up to keep up; capacity has increased far, far faster. This means that the mean recovery time of most business databases today has increased a lot; it takes 1,000 times longer to restore a 1G database common 10 years ago, to a 1Tb database common today. If it took 1 hour to recover a 1Gb database 10 years ago, today it takes 1,000 hours to recover a 1Tb database, or 41 days, given the same hardware. And since database capacity increases happen constantly, it is very easy to forget how long it will really take to recover one.

Faced with these increasing rates of change, and an economy today that usually prohibits hiring more DBAs, it's more common today, for the what I would consider the main responsibility of a production DBA, to be pushed down in priority. However, it is of the utmost importance to remember that with todays environment, it is not correct to think of if a problem will occur one day, but rather when or how soon it will occur.

With this environment of high rates of change, there is only ONE method that can be reliably used to insure that a) recoveries will be successful, and b) management has very good estimates about how long such a recovery is likely to take. That is; to test it completely; to dedicate hardware or a subset of production hardware, and dedicated DBA time, to doing a test recovery, at regular intervals. From a practical standpoint, this can be done 1/quarter, or 1/year, but it's very important to perform it, and to clearly document the results for higher management so that risks are known and understood. The entire procedure really needs to be "under the fingers" of the DBAs or system administrators doing the recovery.

Doing regular testing of database recovery will insure that the backup/recovery procedures in place are actually working as they should. It is not enough to simply back up a database and consider the task done; it is actually very common for those backups to be non-recoverable.

Accomplishing this has one additional positive effect; it will reduce the stress on DBAs and on production managers, as they will be assured that recovery is easily accomplished, and know exactly what to expect, along with the knowledge that everything is in place to complete it successfully.

Author: Jay Stanley of Database Specialists <jstanley@dbspecialists.com>

Date: April 22, 2010

HTML generated by org-mode 6.34c in emacs 22