ASM Grave Errors: ORA-15042 & ORA-15063

ORA-15042:  ASM disk “n” is missing
ORA-15063: ASM discovered an insufficient number of disks for diskgroup “diskgroup name”

These 2 errors are related to missing disk(s) from ASM disk group. After you go through the initial inspection which is basically checking if the disk does exist and correct permissions assigned to Oracle user, if you are not using ASM disk mirroring, then what to do?

In my case, the system administrator took 7 ASM disks belonging to 2 different disk groups. One of these groups was the archive destination while the other was data. I also came to find out that this happened 2 other times in other production systems in addition to 1 time in a development system. All happened by different system administrators on HP and Solaris. Goes to show you how difficult to fit new Oracle system tools in big organizations.

If you have a disk group with 100 disks and you lose 1 disk, then the entire disk group will be dismounted.  Attempts to mount the disk group will render the following error:

ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk “n” is missing

What I find out – the hard way unfortunately – taking a disk from ASM is the mother of all mistakes. The only solution at this point is to use dd command to wipe the headers of all of the disks participating in this disk group, recreate the disk group with same name, and then restore from backup. In my case we had to do point-in-time recovery because we also lost the archive files. Data loss was imminent.

In a nutshell, losing 1 disk from a disk group = losing entire disk group, but why is this?

It is because ASM is designed to take a file and spread (stripe) it across multiple disks in your disk group for performance benefits.  You can find out the stripe size or allocation unit (AU) from asmcmd -> lsdg. So if you lose 1 disk, you lost parts or units of every file in this disk group.

This got me thinking about a recommendation I read a while back which said create 2 disk groups only to simply maintenance; 1 for archive destination and the other for data. But if you come across a scenario like this and you have a large database, you would lose all you control files and also you would have a prolonged downtime to recover your entire database.

If you feel you need to design your environment to handle such a scenario, then here are some recommendations:

  1. Disk group total size should be recovered within the time limit allowed for recovery. That means you will have more disk groups, not just 1 or 2.
  2. Multiplex your control files in different disk group. In my case, I lost all control files.
  3. If possible, multiplex your archive destination to a file system (not ASM); 1 destination is ASM while the other is file system outside ASM. This will allow you to recover to point of failure even if you loss your entire ASM disks.

Hazem Ameen
Senior Oracle DBA