Vidya Bala's Blog

Search My Blog

Add RSS Feed

Monday, February 26, 2007

A Strange Production Problem!!!

A Strange Production Problem!!!

I suddenly got a call that the Front end Applications have frozen (those are the worst calls….). I logged on to the database server, was unable to login to the database, at the same time got a call that the ……………….

Network Appliance filer experienced a kernel panic or a low-level system-related lockup. The device then rebooted itself to correct the problem and proceeded normally through the startup process.

The database was a 2node RAC Cluster both accessing the NetApp Device via NFS mount points. After the NetApp rebooted itself:

NodeA on the database looked fine: ORACM was up on the server, could login to the database from NodeA.
NodeB: ORACM was down, Instance on NodeB was down.

Net Result: Application was still unable to connect to either of the Nodes using TAF.

Since the Applications were anyways down, the decision was made to restart the Cluster Manager on both nodes and start both the instances. The above resumed operations fairly quickly (not too much time was spent on roll forward and rollback operations, we did not have any long running transactions at the time of abort).

An SR has been opened to discuss if the above was the expected behavior.

With RAC I would have expected the following to happen:


Each Oracle instance registers with the local Cluster Manager. The Cluster Manager monitors the status of local Oracle instances and propagates this information to Cluster Managers on other nodes. If the Oracle instance fails on one of the nodes, the following events occur:
1. The Cluster Manager on the node with the failed Oracle instance informs the Watchdog daemon about the failure.
2. The Watchdog daemon requests the Watchdog timer to reset the failed node.
3. The Watchdog timer resets the node.
4. The Cluster Managers on the surviving nodes inform their local Oracle instances that the failed node is removed from the cluster.
5. Oracle instances in the surviving nodes start the Oracle9i Real Application Clusters reconfiguration procedure.

The nodes must reset if an Oracle instance fails. This ensures that:
· No physical I/O requests to the shared disks from the failed node occur after the Oracle instance fails.
· Surviving nodes can start the cluster reconfiguration procedure without corrupting the data on the shared disk.

In 9i Cluster Reconfiguring is supposed to be fast remastering resources only if necessary and processes on Node A will be able to resume active work during reconfiguration as their locks and resources need not be moved.

However, this was not the behavior we saw when one node totally crashed in our case – while RAC is great it helps you load balance your requests – does it really help in Disaster Recovery ?