Return to MAIN-Index  Return to SUB-Index    IBM-AUSTRIA - PC-HW-Support    30 Aug 1999

Servicer Assistance Information for Shared Disk Clusters



Record number: H161480


Abstract: PROBLEM DIAGNOSTICS IN A SHARED DISK CLUSTER ENVIRONMENT_V02


Summary: Servicer Assistance Information for Shared Disk Clusters.

Shared Cluster - Problem Diagnostic 

The intent of the document is to assist with the diagnosis of problems running the following on PC Servers and Netfinity Servers that support each:



The IBM PC Servers and Netfinity Servers certified by IBM for clustering are:


The IBM Storage Enclosures certified by IBM for clustering are:



NOTE: Current certified Server and Storage Enclosure configurations are listed at the following IBM
Website URL: http://www.pc.ibm.com/us/compat/clustering/matrix.shtm

1.0 Shared disk cluster identification
2.0 Strategies to prioritize where to begin problem diagnosis
2.1 Problem Determination Guidelines
2.2 Recovering a down cluster before problem determination
2.3 Specific Cluster/Node Strategies
3.0 Replacing a failed ServeRAID II Adapter
4.0 Contacting Support
5.0 Service tips
5.1 Diagnostics
5.2 Shared Disk Subsystems
5.3 Communication Link/LAN Adapters
6.0 Glossary of terms


This document should be used in conjunction with the Servicer Training Video Series Volume 19, Shared Disk Clustering.

This document will be updated as necessary. The latest version of this document can be found at:

http://www.us.pc.ibm.com/support (You may search on the title of this document)

Maintaining Cluster High Availability requires the Cluster is always available for LAN attached users to access. With this in mind, always consider keeping at least one node supporting a cluster online while performing problem isolation/diagnostics on other nodes that are part of the cluster.


1.0 Shared disk cluster identification: 

Provided below are ways to identify shared disk cluster configurations. Positive identification to one of the below tasks does not ensure that you will be working with an active shared disk cluster but should allow you to take appropriate steps as necessary. Problem determination steps outlined in this document are safe to use with stand-alone server configurations.

  1.  Ask the customer if they are using shared disk clustering.
  2.  Ask the customer if they are running one of the following packages:

  3.  Look for cluster identifiers such as:


2.0 Strategies to prioritize where to begin problem diagnosis: 

When performing problem diagnosis on a node in a cluster or on a cluster, maintaining cluster high availability should be the greatest priority.

  Remember:

Maintaining Cluster High Availability requires the Cluster is always available for LAN attached users to access.
With this in mind, always consider keeping at least one node supporting a cluster online while performing problem isolation/diagnostics on other nodes that are part of the cluster.

  1.  Try to get cluster operations running on at least one node if multiple nodes are down.
  2.  Do problem determination that allows the cluster to continue to run.
  3.  Only if absolutely necessary should problem determination be implemented that requires the  entire cluster to be shut down.


2.1 Problem determination guidelines: 

The guidelines below can assist in maintaining cluster high availability and quick problem resolution. Problem determination applied to a node that requires the node is NOT available for cluster operation:


Problem determination requiring the shutdown of an entire Cluster:


2.2 Recovering a down cluster before problem determination 


2.3 Specific Cluster/Node Strategies: 

If a cluster is down and each node is in a different node state use the strategy of the highest ranking node state:

       Rank              Node State
        1                Node failure
        2                Hang/Trap condition
        3                Running with errors
        4                Running without errors


The following Problem Determination Guideline Matrix references Cluster states in conjunction with Node states that reference their respective document sections that follow.

  +------------------------------------------------------------------------------+
  | Cluster      |   Node(s)     |   Node(s)     |   Node(s)     |   Node(s)     |
  | State        |   State       |   State       |   State       |   State       |
  |--------------|---------------|---------------|---------------|---------------|
  | Cluster      |  Node failure |   Node(s) in  |  Node(s) run  | Node(s) run   |
  | Down         |               | a Hang/Trap   |   with errors |   w/o errors  |
  |              |        2.3.1  |       2.3.2   |       2.3.3   |       2.3.4   |
  |--------------|---------------|---------------|---------------|---------------|
  | Cluster      |  Node failure |   Node in a   | Node running  |  Node(s) run  |
  | in           |               |   Hang/Trap   |   with errors |    w/o errors |
  | failover     |       2.3.5   |       2.3.6   |       2.3.7   |       2.3.8   |
  | mode         |               |               |               |               |
  +------------------------------------------------------------------------------+


The following strategies should be used in conjunction with the Problem Determination Guideline Matrix above.

Note: Items are rated "High" (most probable failure) to "LOW" (least probable failure).


2.3.1 Cluster down and node failure. 

               * Shared Disk Subsystem -------------------------------   High
               * Cables to shared disk systems -----------------------   High
               * Shared disk subsystem host adapter ------------------   Medium
               * Node failure ----------------------------------------   Low
               * Software errors or configuration --------------------   Low
               * Communication Link ----------------------------------   Low

2.3.2 Cluster down and node(s) in a hang trap condition. 

               * Software errors or configuration --------------------   High
               * Cables to shared disk systems (termination) ---------   High
               * Subsystem host adapters (SCSI ID) -------------------   High
               * Shared disk subsystem -------------------------------   Medium
               * Communication Link ----------------------------------   Low
               * Node failure ----------------------------------------   Low

2.3.3 Cluster down and node(s) running with errors. 

               * Shared disk subsystem -------------------------------   High
               * Subsystem host adapters -----------------------------   High
               * Cables to shared disk systems -----------------------   High
               * Software errors or configuration --------------------   Medium
               * Node failure ----------------------------------------   Low
               * Communication link ----------------------------------   Low

2.3.4 Cluster down and node(s) running without errors. 

               * Communication link ----------------------------------   High
               * Software errors or configuration --------------------   High
               * Shared disk subsystem -------------------------------   Low
               * Cables to shared disk systems -----------------------   Low
               * Subsystem host adapters -----------------------------   Low
               * Node failure ----------------------------------------   Low

2.3.5 Cluster in failover mode and node failure. 

               * Node failure ----------------------------------------   High
               * Cables to shared disk systems -----------------------   Medium
               * Subsystem host adapters -----------------------------   Medium
               * Communication line ----------------------------------   Medium
               * Software error or configuration ---------------------   Low
               * Shared disk subsystems ------------------------------   Low

2.3.6 Cluster in failover mode and a node in a hang trap condition. 

               * Software errors or configuration --------------------   High
               * Node failure ----------------------------------------   High
               * Subsystem host adapter ------------------------------   Medium
               * Cables to shared disk systems (termination) ---------   Medium
               * Shared disk subsystems ------------------------------   Low
               * Communication link ----------------------------------   Low

2.3.7 Cluster in failover mode and a node running with errors. 

               * Communication link ----------------------------------   High
               * Subsystem host adapters -----------------------------   High
               * Cables to shared disk systems -----------------------   High
               * Node failure ----------------------------------------   Low
               * Software errors or configuration --------------------   Low
               * Shared disk subsystems ------------------------------   Low

2.3.8 Cluster in failover mode and node(s) running without errors. 

               * Communication link ----------------------------------   High
               * Cables to shared disk systems -----------------------   High
               * Subsystem host adapters -----------------------------   Medium
               * Software errors or configuration --------------------   Medium
               * Shared disk subsystem -------------------------------   Low
               * Node failure ----------------------------------------   Low

3.0 Replacing a failed ServeRAID II adapter in a High-Availability configuration. 

NOTE: The following procedure requires that specific configuration settings on the ServeRAID II adapter can be obtained form the adapter that is being replaced or were noted when the adapter was previously configured and are available for reconfiguring the new adapter.

NOTE: Obtaining the correct information for these settings is the responsibility of the user and is required to accomplish this procedure.

Step 1:


Step 2:


Tip:

SCSI Bus Initiator_Ids for non-shared SCSI channels will normally be set to 7. However, for shared SCSI channels the ID's will usually be 7 or 6 and must be different than the SCSI Bus Initiator_Ids for the corresponding SCSI channels of the cluster partner adapter. You may obtain the SCSI Bus Initiator_Ids from the corresponding cluster partner adapter by booting the ServeRAID Configuration Diskette on the cluster partner system and selecting the "Display/Change Adapter Params" option from the "Advanced Functions" menu. From this information, the correct settings for the replacement adapter can be determined. For example, if the cluster partner's shared SCSI bus Initiator_Ids were set to 7, then the replacement adapter would typically need to be set to 6.

The proper settings for the Host_Id and Cluster Partner's Host_ID of the adapter being replaced may be determined by reading the settings from the cluster partner system by using the "Display/Change Adapter Params" option. In this case, the adapter being replaced should have it's Host_Id set to the same value as is defined for the Cluster Partner's Host_Id on the corresponding adapter in the cluster partner system. The Cluster Partner's Host Id of the replacement adapter should be set to the same value as is defined in the Host_Id of the corresponding adapter in the cluster partner system.


Example:

                                          Node A          Node B
    SCSI Bus Initiator_Ids                   7               6
    Adapter Host_Id                      Server001       Server002
    Cluster Partner's Host_Id            Server002       Server001


Step 3:



Step 4:

Step 5:

Step 6:

Step 7:

Step 8:

Step 9:


NOTE: The "IBM ServeRaid, ServeRAID II, and onboard ServeRAID configuration diskette" MUST NOT be used to perform failover/rollback to merge/unmerge drives belonging to the other node. Failover/rollback to merge/unmerge drives belonging to the other Node is normally handled by the Operating System software and/or Cluster Support Software.

Step 10:

_____________________________________________________________________________________________________


4.0 Contacting support

When it is necessary to contact IBM support for assistance in resolving the cluster problem, the below information will help support more quickly understand the cluster environment and problem.


5.0 Service tips


5.1 Diagnostics


5.2 Shared Disk Subsystems


5.3 Communication Link/LAN Adapters


6.0 Glossary of terms

Cluster:
A collection of interconnected whole computers utilized as a single unified computing resource.

Node:
A server participating in a cluster.

Communication Link:
The link between the nodes of the cluster used for Cluster communication. This is usually an Ethernet link.

Failover:
The action where processes and applications are stopped on one node and restarted on the other node.

Failback:
The action where processes and/or applications return back to the node they are configured to run on during normal cluster operation.

Cluster down:
A state where either multiple nodes are physically not functioning or clients cannot access the cluster or virtual servers configured on the cluster.

Failover mode:
A state where one node in the cluster is handling cluster activity while the other node is offline or not functioning.

Node failure:
A state where a node hardware failure is exhibited by one of the following attributes:


Hang trap condition:
A state where a software failure has halted operation of a node exhibited by any of the following:


Running with errors:
The operating system on the node is capable of running and is reporting errors. Cluster activity on this node has ceased to function.

Running without errors:
The operating system on the node is capable of running and is NOT reporting errors. Cluster activity on this node is not functioning.

_____________________________________________________________________________________________________


Windows NT, and Microsoft Cluster Server are trademarks of Microsoft Corporation.
Microsoft is a registered trademark of Microsoft Corporation.
NetWare and IntranetWare are trademarks of Novell, Inc.
SYMplicity is a trademark of Symbios Logic, Inc.
MetaStor is a trademark of Symbios Logic, Inc.

_____________________________________________________________________________________________________


Back to  Jump to TOP-of-PAGE
More INFORMATION / HELP is available at the  IBM-HelpCenter

Please see the LEGAL  -  Trademark notice.
Feel free - send a Email-NOTE  for any BUG on this page found - Thank you.