TY - JOUR
T1 - A comprehensive survey of machine learning and deep learning approaches for anomaly detection in high-performance computing systems
AU - Ki, Cibin
AU - Sivakumar, Ramah
AU - Mulerikkal, Jaison
AU - Binu, A.
AU - Gupta, Manish
AU - Jan, Tony
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/6
Y1 - 2025/6
N2 - Anomaly detection is crucial in high-performance computing (HPC) systems for maintaining effective, efficient, and secure operations. This survey focuses on the current status of the application of machine learning and deep learning in HPC systems for detecting various types of anomalies, including performance anomalies, operational anomalies, and security anomalies. The study takes a thorough look at the current approaches using diversified machine learning and deep learning techniques, the significance and challenges that anomaly detection in HPC systems brings, as well as the factors that should be considered in determining the performance of the systems according to the research conducted. Additionally, it explores tools and frameworks created using these techniques, specifically tailored for HPC systems. Nevertheless, it also reveals the issues with existing models, and based on them, further research is suggested. Hence, the discoveries unveiled in this study will be helpful for researchers and professionals specializing in anomaly detection within HPC systems.
AB - Anomaly detection is crucial in high-performance computing (HPC) systems for maintaining effective, efficient, and secure operations. This survey focuses on the current status of the application of machine learning and deep learning in HPC systems for detecting various types of anomalies, including performance anomalies, operational anomalies, and security anomalies. The study takes a thorough look at the current approaches using diversified machine learning and deep learning techniques, the significance and challenges that anomaly detection in HPC systems brings, as well as the factors that should be considered in determining the performance of the systems according to the research conducted. Additionally, it explores tools and frameworks created using these techniques, specifically tailored for HPC systems. Nevertheless, it also reveals the issues with existing models, and based on them, further research is suggested. Hence, the discoveries unveiled in this study will be helpful for researchers and professionals specializing in anomaly detection within HPC systems.
KW - Anomaly detection
KW - Deep learning
KW - High-performance computing
KW - HPC systems
KW - Machine learning
UR - https://www.scopus.com/pages/publications/105007953368
U2 - 10.1007/s11227-025-07503-4
DO - 10.1007/s11227-025-07503-4
M3 - Article
AN - SCOPUS:105007953368
SN - 0920-8542
VL - 81
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 8
M1 - 1032
ER -