TY - JOUR
T1 - A review of major ICT failures and recovery strategies
T2 - Strengthening digital resilience
AU - Adel, Amr
AU - Alani, Noor H.S.
AU - Jan, Tony
AU - Prasad, Mukesh
N1 - Publisher Copyright:
© 2025 Elsevier Ltd
PY - 2025/12
Y1 - 2025/12
N2 - This paper presents a comprehensive, cross-sector analysis of large-scale ICT failures to address the persistent gap in understanding how systemic digital breakdowns occur and propagate across platforms and industries. Through a comparative study of seven major global outages (2019–2024) — selected based on scale, technical transparency, and platform diversity — we identify recurring vulnerabilities in automation governance, configuration management, centralized infrastructure, and incident response. Using a custom analytical framework grounded in socio-technical and resilience engineering theory, the paper maps failure propagation patterns and derives a taxonomy of technical and organizational failure modes. We empirically validate a suite of resilience strategies — including rollback automation, configuration-as-code, SOAR-enabled response orchestration, and chaos engineering — and demonstrate how they address failure propagation pathways observed in real-world incidents. A conceptual model for decentralized system upgrade planning is introduced, incorporating microservice segmentation, dependency mapping, and AI-assisted fault containment. The paper culminates in a forward-looking digital resilience roadmap that integrates predictive analytics, secure software supply chains, and adaptive human–machine collaboration. Core contributions include: (1) a cross-case classification of failure archetypes, (2) evidence-based design patterns for resilience, and (3) actionable frameworks for infrastructure operators and researchers working towards next-generation ICT robustness.
AB - This paper presents a comprehensive, cross-sector analysis of large-scale ICT failures to address the persistent gap in understanding how systemic digital breakdowns occur and propagate across platforms and industries. Through a comparative study of seven major global outages (2019–2024) — selected based on scale, technical transparency, and platform diversity — we identify recurring vulnerabilities in automation governance, configuration management, centralized infrastructure, and incident response. Using a custom analytical framework grounded in socio-technical and resilience engineering theory, the paper maps failure propagation patterns and derives a taxonomy of technical and organizational failure modes. We empirically validate a suite of resilience strategies — including rollback automation, configuration-as-code, SOAR-enabled response orchestration, and chaos engineering — and demonstrate how they address failure propagation pathways observed in real-world incidents. A conceptual model for decentralized system upgrade planning is introduced, incorporating microservice segmentation, dependency mapping, and AI-assisted fault containment. The paper culminates in a forward-looking digital resilience roadmap that integrates predictive analytics, secure software supply chains, and adaptive human–machine collaboration. Core contributions include: (1) a cross-case classification of failure archetypes, (2) evidence-based design patterns for resilience, and (3) actionable frameworks for infrastructure operators and researchers working towards next-generation ICT robustness.
KW - AI-driven recovery
KW - Automation failures
KW - Comparative review
KW - Cybersecurity infrastructure
KW - Digital resilience
KW - ICT outages
KW - Incident response
UR - https://www.scopus.com/pages/publications/105016779119
U2 - 10.1016/j.cose.2025.104678
DO - 10.1016/j.cose.2025.104678
M3 - Review article
AN - SCOPUS:105016779119
SN - 0167-4048
VL - 159
JO - Computers and Security
JF - Computers and Security
M1 - 104678
ER -