DCImanager server diagnostics

From ISPWiki
Jump to: navigation, search

Server diagnostics allows to perform the following tasks

  • it allows to easily add server data into DCImanager,
  • clear disks (in some cases OS installation may fail due to remainders of the previously installed operating system on a hard drive),
  • МАС-addresses for server with 2 and more network cards for correct configuration of the DHCP-server configuration file can be specified only during diagnostics.

Starting the process

In order to perform server diagnostics, select a server in the "Servers" list, and click the "Operations" button. In the form that will open select "Run diagnostics" in the "Operation type" drop-down list.

Also select a diagnostics template; "Diag-x86_64" is the default value set by DCImanager. You can use any other template. (Read more about OS templates).

If you need disks to be cleared after the diagnostics, check the "Clear disks" box. You will also see the option "Full hard drive erase". Selecting this check box will zero the whole hard drive (it will increase the diagnostics time). If you don't select the check box, only the first 512 MB of hard drive will be zeroed.

The diagnostics process also starts with automatic server search configured, or when a server is released (provided that the "Check before releasing " option is selected in the "Global settings" module.

Requirements

  • A MAC-address and IP-address should be specified for the target server (further on "Server").
  • A "Server" should be available from the DCImanager server (further on "DCImanager")
  • Network boot should be set up onе the "Server".
  • The "Server" must be connected to a "PDU" or IPMI. Otherwise, when the diagnostics start, the "Server" will require manual restart.
  • Be sure to configure "DHCP" on "DCImanager". ("Global settings" -> "DHCP settings" -> "Interfaces").

How it works

After the diagnostics process starts, the system creates a block in the DHCP configuration file, which enables to work with the server's MAC-address. Once the "Server" passes the authorization procedure via DHCP, the diagnostics template will uploaded to the server.

Once completed, the server check script will start (the "Diag-x86_64" template) and determine: a processor model, ammount of RAM, the presence of a hardware raid controller, availability of HDDs. Local connection speed; read rate and SMART-information of HDD will be also checked.

IPMI, if any, will be configured: it will be assigned an IP-address, mask, Gateway, user and password. If the "Add IPMI automatically" option is set in t"Global settings" module, IPMI will be added to the server in DCImanager. All the information will be sent to "DCImanager".

DCImanager will verify that the server platform corresponds the received data, and if they do not match, it will automatically create a new platform and assign it to the server.

If the "Power off servers upon checking" option is selected in the "Global settings" module, the server will be powered off or switched into its normal operation after the diagnostics is finished.

Processing results

If a hardware RAID is not found on the server, the "smartctl" won't be able to show correct information on hard drives. Thus, the "Server has hardware issues" icon will be added, and in the server edit form you will be asked to specify HDD manually. During the diagnostics process, all HDD will be unassigned from the server. If a hardware RAID is found, the system will unassign only those HDD that were assigned after diagnostics (the disks specified manually will remain unchanged).

If a "Platform type" is not specified for the server, it will show the "Server has hardware problems" icon.

If a "Platform type" is selected , the system will compare it with detected hardware. The amount of processes (cannot be 0 or exceed the value specified for the platform type), RAM (cannot be 0 or exceed the value specified for the platform type), number of HDD (cannot exceed the value specified for the platform type). If results differ from the values specified for the platform type, the "Server has hardware problems" icon will be displayed for that server.


If HDD information is available, the following parameters will be checked: read rate (threshold value can be set in "Hardware types" -> "HDDs" -> "HDD types", 100 Mb/sec by default, SMART parameters (check parameters are set in "Hardware types" -> "HDDs" -> "HDD types", Reallocated_Sector, Seek_Error_Rate, UDMA_CRC_Error_Count, Current_Pending_Sector, Offline_Uncorrectable, Media_Wearout_Indicator are checked by default).

If results differ from the values specified for the platform type, the "Server has hardware problems" icon will be displayed for that server.

If local connection speed is less than ("LocalSpeedThreshold")*(Port_speed)/100, the "Server has hardware problems" icon will be displayed. The default value for LocalSpeedThreshold" is 80.

Example: the default threshold for port 100 MB/sec is 80 MB/sec.

Starting from version 5.94 we implemented a new mechanism, which allows to define correspondence between a platform and CPU, using sockets.

If the diagnostics process detects an unknown CPU, the “Server has hardware issues” will be displayed.

If it detects CPU, which are not associated with socket, the administrator will see a corresponding banner.

When the "Server has hardware issues" icon is added

Before the diagnostics starts, the "Server has hardware issues" icon is added for the server.

Once completed, the following parameters will be checked:

  • Local connection speed (from <LocalSpeedThreshold*Port_Speed/100> to <Port_Speed>),
  • HDD parameters (read speed and SMART-criteria) are within normal limits (they are specified in the disk types configuration form),
  • Hardware RAID

If no issues are found, the icon will be deleted.

If a user interrupts the diagnostics process, the icon will be displayed.

After the diagnostics is complete, when editing the "Platform type" field on the server edit form, the system will check that server configuration matches parameters of the new "platform". The icon is added, if server configuration and platform parameters do not match.

Removing the "Server has hardware issues" icon

Open the server edit form and find the strings marked in red

Something like this:

A platform type is not selected for this server
No information about configuration of this server. Please, run the diagnostics procedure.

Resolve the issue and run the diagnostics procedure again, if needed.