Automate the reset of the IPMI System Event Log

Last time we discovered a new localcli command to clear the IPMI SEL Event Log after the “”Host IPMI Event Log Status” error. This time, we are going to automate this command with a script, so that future errors will be handled automatically.

Basically, the script is going to:

  • identify which servers have the IPMI alarm active.
  • for these servers, enable SSH.
  • connect with plink (think: putty on command line) and run the commands that will reset the IPMI System Even Log.
  • disable SSH.

The first step is to prepare a text file which will contain the commands to pass over SSH:

localcli hardware ipmi sel clear
nohup /sbin/services.sh restart > foo.out 2> foo.err < /dev/null &
sleep 120
exit

Let’s call this file “CleaIPMILog-commands.txt”. The commands in this file will clear the IPMI SEL logs, restart the services (without being disconnected during the restart), wait for the services to restart, then disconnect.

And now the script.

# Variables
$vCenterName = "your_vcenter_name"
$vCenterLogin = "domain\your_vcenter_admin"
$vCenterPassword = "your_vcenter_password"
$esxilogin = "root"
$esxipassword = "your_root_password"
$plink = "path_to_plink.exe"
$plinkoptions = "-ssh -pw $esxipassword -noagent -m"
$plinkcommands = "path_to_ClearIPMILog-commands.txt"

# script start
#Adding VMware cmdlets
$VMwareManagement = Get-PSSnapin | where {$_.name -match "VMware.VimAutomation.Core"}
if (!$VMwareManagement) {
Add-PSSnapin VMware.VimAutomation.Core
}

# we get the host list from the datacenter
Connect-VIServer -server $vCenterName -User $vCenterLogin -Password $vCenterPassword
$esx_all = Get-VMHost | Get-View

# cycle through the hosts and check for the IPMI SEL alarm
foreach ($esx in $esx_all){
    foreach($triggered in $esx.TriggeredAlarmState){
        $alarm = get-View -Id $triggered.Alarm
        If ($alarm.Info.Name -eq "Host IPMI System Event Log Status"){
            # if the IPMI alarm is detected we start SSH on the host
            $vmhost = get-vmhost $esx.name
            $sshService = Get-VmHostService -VMHost $vmhost | Where { $_.Key -eq “TSM-SSH”}
            Start-VMHostService -HostService $sshService -Confirm:$false
                # Connect with ssh and execute the commands in the text file
                cmd /c "echo Y | $plink $plinkoptions $plinkcommands $esxilogin@$vmhost"
            # stop SSH
            Stop-VMHostService -HostService $sshService -Confirm:$false
        }
    }
}

Save this script next to the text file containing the commands (for instance as ClearIPMILogs.ps1) and update the variables in the header of the script. The script is now ready for execution.

Running the script can take several minutes (especially when several hosts show the error), as we wait two minutes for each server when the services are restarting. Additionally, the alarm needs another 5 minutes to be cleared in the vCenter, so be patient!

The script is best run as a scheduled task, quite regularly (from hourly to daily, depending on your environment and the error frequency).

A few closing remarks:

  • It is possible to avoid having the vCenter credentials in clear text in the script (by storing secure credentials in a file). We will have a look at that another time.
  • This technique is also possible for the ESXi credentials, but sadly it does not work with plink, which only supports clear text passwords. The best you can do for this special case is to restrict access to the script file as much as possible.
  • In the script I assume that SSH is disabled on your servers (which is recommended). If you keep SSH enabled just remove the corresponding lines in the script!