A Sanity Check for External Redundancy
As with any DBA my days are filled with what seem to be unrelated tasks to the profession - writing reports, attending meetings, installing releases, planning capacity, answering alerts, patching binaries and running upgrades. It is easy to forget three things I should be focusing on as a remote DBA:
- Security
- Availability
- Performance
Some aspects of these core functions can be delegated to other groups. For example Network Engineering may manage LDAP to augment database authentication (security), or as in this case external redundancy handled disk failures in ASM (availability).
Beware: YOU MAY DELEGATE AUTHORITY, HOWEVER RESPONSIBILITY CAN BE VERY STICKY. Do not think that because you authorize someone or something to take over part of a task you escape responsibility when things go wrong. In this recent example I noticed warning messages in the ASM alert log after engineers completed an FRU battery maintenance on a SAN.
ORA-17502: ksfdcre:4 Failed to create file +FRA
ORA-00600: internal error code, arguments: [kffbAddBlk04], [],
Note: A recoverable backup was taken prior to the storage maintenance per standard operating procedures.
At this point we took the database out of cluster and redirected the db_recovery_file_dest to local storage, investigating why the FRA disk group mounted and then crashed the instance with an ORA-600 shortly afterwards.
Suspecting metadata corruption we attempted a repair:
SQL> alter diskgroup FRA check all repair
NOTE: starting check of diskgroup FRA
SUCCESS: check of diskgroup FRA found no errors
A little more time goes by and then:
ORA-00600: internal error code, arguments: [kccpb_sanity_check_2]…
Shutting down instance (abort)
After contacting Oracle support and exhausting our options to mount the FRA we ended up initializing the disks with dd and rebuilding the disk group.
Summary: Backups to a Flash Recovery Area may not be as reliable as you think - especially with external redundancy on a single physical or virtual disk. In these situations it is a good practice to maintain additional redo log members on at least two disk groups, and utilize RMAN to regularly copy archivelogs and backups to a secondary location.

Recent Comments