Technical support
Related links
Quick to learn the related knowledge of application and data availability.
High-availability cluster
High-availability clusters (also known as HA Clusters or Failover Clusters) are computer clusters that are implemented primarily for the purpose of providing high availability of services which the cluster provides. They operate by having redundant computers or nodes which are then used to provide service when system components fail. Normally, if a server with a particular application crashes, the application will be unavailable until someone fixes the crashed server. HA clustering remedies this situation by detecting hardware/software faults, and immediately restarting the application on another system without requiring administrative intervention, a process known as Failover. As part of this process, clustering software may configure the node before starting the application on it. For example, appropriate filesystems may need to be imported and mounted, network hardware may have to be configured, and some supporting applications may need to be running as well. HA clusters are often used for critical databases, file sharing on a network, business applications, and customer services such as electronic commerce websites.
HA cluster implementations attempt to build redundancy into a cluster to eliminate single points of failure, including multiple network connections and data storage which is multiply connected via Storage area networks.
HA clusters usually use a heartbeat private network connection which is used to monitor the health and status of each node in the cluster. One subtle, but serious condition all clustering software must be able to handle is split-brain. Split-brain occurs when all of the private links go down simultaneously, but the cluster nodes are still running. If that happens, each node in the cluster may mistakenly decide that every other node has gone down and attempt to start services that other nodes are still running. Having duplicate instances of services may cause data corruption on the shared storage.
In computing, failover is the capability to switch over automatically to a redundant or standby computer server, system, or network upon the failure or abnormal termination of the previously active application,server, system, or network. Failover happens without human intervention and generally without warning, unlike switchover.
Systems designers usually provide failover capability in servers, systems or networks requiring continuous availability and a high degree of reliability.
At server-level, failover automation takes place using a "heartbeat" cable that connects two servers. As long as a regular "pulse" or "heartbeat" continues between the main server and the second server, the second server will not initiate its systems. There may also be a third "spare parts" server that has running spare components for "hot" switching to prevent down time.
The second server will immediately take over the work of the first as soon as it detects an alteration in the "heartbeat" of the first machine. Some systems have the ability to page or send a message to a pre-assigned technician or center.
Failback, conversely, involves the process of restoring a system/component/service in a state of failover back to its original state (before failure).
Why you need cluster?
With the growth of global economy, the rely on the IT systems from a variety of organizaitons in all over the world are growing, e-commerce makes continuous business with 24 hours, seven days a week as possible. The new business model drives new requirements for our computers or servers, especially in the areas such as commercial organizations, social institutions and etc. The trend is very clear, we need rely on the stable computer systems all the time. The growth of demand makes high availability become very important. To a large extent, the business of many corporations and organizations rely on computer systems, any downtime will cause serious losses, key IT system failures may soon cause the breakdown of the entire commercial operation, so every minute of downtime maybe means the loss of revenue, production and profit, or even market position will be weakened.
RPO-Recovery point objective
Recovery Point Objective (RPO) describes the acceptable amount of data loss measured in time.
The Recovery Point Objective (RPO) is the point in time to which you must recover data as defined by your organization. This is generally a definition of what an organization determines is an "acceptable loss" in a disaster situation. If the RPO of a company is 2 hours and the time it takes to get the data back into production is 5 hours, the RPO is still 2 hours. Based on this RPO the data must be restored to within 2 hours of the disaster.
RTO-Recovery time objective
The Recovery Time Objective (RTO) is the duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
It includes the time for trying to fix the problem without a recovery, the recovery itself, tests and the communication to the users. Decision time for users representative is not included.
The business continuity timeline usually runs parallel with an incident management timeline and may start at the same, or different, points.