Thermal Anomaly Detection in Data Centers

Participants: 

Eren Erdogan
David Janusz
Timothy Sawma

Advisor: 
Prof. Dario Pompili



Introduction

The goal of our project is to design a system to detect thermal anomalies in a data center with the use of a thermal camera in real time. Servers must be kept cool for full functionality; it is essential to prevent them from overheating. Overheating servers can cause major problems, such as network failure or delayed connection, to schools, companies, hospitals, etc. These institutes cannot afford to have their servers crash because everything they do is time sensitive and crashed server will hinder their efficiency and productivity. To prevent this from happening, a system must be in place to detect these anomalies in server rooms in real time. We will address these thermal anomalies in data centers by using a thermal camera to capture a live feed of the rack of servers in the room. The main focus will be where thermal anomaly occurs and how hot these spots are. This research is important because the need for servers is continuing to increase.

Motivation
Many servers can overheat by overloading it with data, warm rooms, or just getting warm from the heat of another server on its rack. To prevent these rooms from getting too warm, a lot of money is spent on air conditioning to keep the entire room cold. However, if there is a system in place that will notify an administrator of an anomaly in the server room, then these companies will be at less of a risk of system crashes. A system in real time must be created that can detect when and where an anomaly occurs. This anomaly detection system will help the administrators of the server rooms maintain the servers and keep them operating at an optimum efficiency.

Design
The system we designed uses a thermal camera to capture the live feed from a rack of servers, and gathers data based on the temperature of each pixel. By gathering data from the live stream there will be more information available on the behavior of the temperature in the room at different areas. Each frame is processed and a live stream that is colored red represents the thermal anomaly that is occurring. This helps determine the location of the anomaly. There is also a cropped video, which just zooms in on the hotspot because the data is better represented by focusing on the hotspot than on the entire frame.

Results
The histogram, skewness, and kurtosis graphs are all running live with the video, so the user can see how the data changes. These graphs are based on the cropped video to make them more accurate. The threshold selected for a thermal anomaly is 27 degrees C, which is marked on the histogram. The last part of the implementation is the alert system that will appear on the screen once an anomaly occurs at a particular server. This alert system is also designed to immediately send an email to the user to give an alert of when and where the anomaly occurs.

Conclusion
This main focus of this project is to create a real-time system that will determine when and where thermal anomalies occur, as well as their intensity.