Cluster node accessibility diagnostics

From ISPWiki
Jump to: navigation, search
Hierarchy: VMmanager Cloud -> Cloud functions

The diagnostics system is based on corosync and our custom tool corolistener.

Corosync

Corosync will automatically identify components of the cluster using the multicast/unicast packets that are sent to a certain IP address and port. More information about the transport, IP address and port for corosync can be found under "Setting up cloud functions".

The corosync ( /etc/corosync/corosync.conf ) configuration file contains the following information:

The Main section (in the following example multicast is used):

 totem {
 	interface {
 		bindnetaddr: 172.31.223.47
 		mcastaddr: 239.255.1.1
 		mcastport: 5405
 		ttl: 1
 	}
 	config_version: 1
 	cluster_name: VMmanager
  } 
  quorum {
 	expected_votes: 3
  }
  ihttpd {
 	count: 1
 	port0: 1500
  }

The totem section:

  • bindnetaddr - IP address that will be bound to a service;
  • mcastaddr and mcastport - multicast port and address;
  • config_version defines a version of the configuration file. As long as you add/delete servers from the cluster, the configuration file will change. Changing the configuration file will also change the version number enabling to keep the corosync configuration file up to date on all the cluster nodes.

The quorum section:

  • expected_votes is the total number of servers in the cluster. This value is required for quorum.

The ihttpd section:

  • count and port0 - define the number of ports, and the port where ihttpd will run on the new master server (in the event of the failure of the master server and start of VMmanager on an available node).

Cluster node list section

nodelist {
	node {
		ring0_addr: 172.31.223.46
		nodeid: 2
		prio: 99
		replication: on
	}
	node {
		ring0_addr: 172.31.223.47
		nodeid: 4
		prio: 100
		replication: on
	}
	node {
		ring0_addr: 172.31.223.48
		nodeid: 8
		prio: 98
		replication: on
	}
}

The nodelist section contains information about all cluster nodes:

  • ring0_addr - IP adrress of the cluster node;
  • prio - priority of the cluster node; (you can change it in VMmanager -> Cluster nodes -> Edit). When the master servers goes down, the server in quorum with the maximum priority will be considered a new master server.
  • replication - "on" means the VMmanager database replication is enabled; "off" - the replication is disabled.

Corolistener

corolistener is an in-house tool specially developed by ISPsystem to analyse information from corosync and decide whether to restore the services or move virtual machines to another node. corolistener is enabled right after you add a server into the cluster with enabled cloud functions. when you activate Cloud functions. Corolistener is located in the /usr/local/mgr5/sbin/corolistener directory.

Corolistener processes corosync events on every cluster node. If the number of cluster nodes changes, corolistener will analyse the quorum on every cluster node, and start the following processes:

  • The number of available nodes is less than quorum - the system and virtual machines will be suspended:
  • The number of available nodes is larger than quorum, and the node has the largest priority among other available nodes": this node will be considered the master. The system will perform following operations:
    • Adds the cluster (license) IP address to the cluster node interface which is specified by the CloudIpDev directive in the configuration file (the default value is vmbr0). Its mask is specified by the CloudMask directive of the VMmanager configuration file.
    • Creates the /tmp/.lock.vmmgr.service file and starts VMmanager. The file indicates that VMmanager can start on that node. VMmanager will check whether it was moved to the server or it was simply restarted.
    • If the tmp/.lock.vmmgr.relocated file is missing VMmanager will consider it has been moved to that node. It downloads the database replica and locates the virtual machines from the cluster nodes that failed.
    • Corolistener changes the priority of the node, updates the corosync configuration file, and informs the remaining cluster nodes that they should request a new configuration file from the master.
  • The number of available nodes is larger than quorum, and the cluster node doesn't have the largest priority: the cluster node will keep on running as slave.

In order to check the current state of the cluster node, you can use corosync:

 # corosync-quorumtool -l
 Membership information
 ----------------------
     Nodeid      Votes Name
         7          1 172.31.224.72 (local)
         9          1 172.31.224.74
        13          1 172.31.224.80

or corolistener:

# /usr/local/mgr5/sbin/corolistener -l
 VMmanager-cloud node list
 =============================================================
         Id               Ip    Status  Master/Slave  Priority
          7    172.31.224.72    joined             M       100
          9    172.31.224.74    joined             S        40
         13    172.31.224.80    joined             S        10

corolistener output gives more information, as it shows the master node and nodes priorities.

corolistener has the log file /usr/local/mgr5/var/corolistener.log