Precautionary Action Plan Of Hardware Failure In High Availability Database Servers

  • Posted on October 11, 2014 at 4:25 pm

Hardware failures pertain to failures experienced by different component such as disks, controllers, CPUs, memory, routers, gateways, cables, tape drives, fans, and so on. Such failures generally call for the malfunctioning component to be repaired or replaced. Once thats done, regular operations can be restarted. It is important to maintain multiple components for those susceptible to frequent failures. Single points of failures are common causes of high downtime. For instance, twin-tailed disks that connected across two nodes or mirrored disks allow systems to remain functional even when a node or mirror goes down in the case of a node going down, there may be a brief disruption, again depending on the fail-over solution implemented.

If just a single component is maintained, during the time it is being repaired or replaced in the repair window, service will be unavailable. Sometimes, the time required to obtain a component can be horrendously long. For example, when we blew a disk controller at a client site, the hardware engineers determined that the component needed to be totally replaced. It was not available locally and had to be flown in from out of state, taking at least two days to get in and be replaced. Situations like these can cause tremendous downtime.

There is a chance of failure of hard disk or disk controllers too. Important thing is that, we cant able to judge this kind of hardware failure. But for precautionary action plan, we need backup server in high availability database server environment. If backup server is not possible in every cases then we need to make strong Service Level Agreement SLA with strict condition and response time with hardware vendors. Sometime we need to put some penalty clause too in response time measurement of replacement and availability of failure component. Using this trick we are able to maintain lowest downtime for high availability database servers.

A two-day service interruption may cause a company to out of business in certain cases. To avoid situations like these, some of my clients have a component availability guarantee clause put in all hardware contracts, wherein the vendor would ensure that specific components prone to failure would be available locally or obtained within an acceptable predetermined time period from specific remote sources for immediate replacement. Identify all critical components are augment them with clones.

Additionally, all hardware should be maintained securely inside cabinets to prevent accidental or intentional damage. All guidelines from the manufacturer temperature, proximity to other devices, and so on should be strictly followed. Cabling should be done as neatly as possible, so as not to get in the way of people walking about. Raised floors in your data centers greatly help in preventing accidents.