ISS Technology Update
Volume 7, Number 1
Fan failure recovery in ProLiant DL and ML servers
HP ProLiant DL and ML servers implement several technologies for handling high-temperature situations resulting from a fan
failure. Depending upon the model of the server, the activity of the server when it failed, and the system configuration, these
technologies exhibit a number of different behaviors as described in scenarios within this article.
HP ProLiant features for self-managing the thermal environment
This article focuses on the thermal management features of the ProLiant 300 and 500 series DL and ML servers that use the
Integrated Lights-Out (iLO) management controller. The iLO controller resides on the system board of a host server. It contains its
own management processor, memory, and network interface that allow it to operate independently from the host server. Among
other features, the iLO controller monitors the actual temperatures within the system based on thermal sensors strategically
located to protect essential components. Alternatively, the ProLiant 100 series ML and DL servers use the ServerEngines Pilot
BMC.
The HP ProLiant iLO2 Management Controller Driver, referred to as the health driver (Windows) or hpasm package (Linux),
determines the presence and status of the fans in the system and reports in the OS logs on their redundancy. All temperature
monitoring and fan control by the iLO2 controller takes place regardless of the state of the host operating system (OS). The
main functions of the health driver are to make environmental information available to processes running on the host (for
example, HP Systems Insight Manager and Insight Agents) and to complete tasks involving the host OS, such as a graceful
shutdown.
Advanced Configuration and Power Interface (ACPI) shutdown
If a ProLiant DL or ML server experiences a component failure that requires a system shutdown, the health driver (if configured to
do so) can initiate a shutdown. This shutdown will occur as if a system administrator had initiated it. If the health driver is not
running, then the iLO firmware will simulate a power button press by using the Advanced Configuration and Power Interface
(ACPI) mechanism. However, the shutdown is not triggered if the management console is locked. In this case, the system will
continue to operate unless a critical temperature is reached, which will trigger a loss of power to the system.
In general, ProLiant servers attempt to operate in degraded conditions as long as possible without risking data corruption. In the
unlikely event that multiple fans fail, or in conditions where appropriate cooling cannot be maintained, the health driver and/or
iLO controller will attempt to shut down the OS to help prevent data corruption, data loss, and system failure. When the health
driver is operational, it tells the iLO controller to initiate a graceful shutdown. This shutdown takes effect 60 seconds after it has
been determined that appropriate cooling conditions cannot be maintained. When the health driver is not operational, the iLO
controller implements a graceful shutdown through the power button and ACPI mechanisms.
ProLiant DL and ML servers with redundant fans respond to fan failures based upon the number of fans that fail, the state of the
server when the failure occurs, and the presence and configuration of the health driver.
10