Open Access

Adaptive Chaos Engineering and AI-Driven Dependability Modeling for Resilient Cloud-Native and Safety-Critical Systems

4 Department of Computer Science, University of Lyon, France

Abstract

The increasing reliance on cloud-native architectures, serverless computing, and artificial intelligence-driven systems has introduced new complexities in ensuring system dependability, resilience, and safety. Traditional reliability engineering approaches, while foundational, are often insufficient in addressing the dynamic, distributed, and failure-prone nature of modern cloud ecosystems. This research presents a comprehensive, theoretically grounded framework that integrates chaos engineering, machine learning-based reliability modeling, and human-centered safety principles to enhance system robustness across cloud-native and safety-critical domains, including healthcare and autonomous systems.

The study synthesizes interdisciplinary perspectives from cloud computing, dependability engineering, fault injection methodologies, and AI-based safety analysis. It explores how experimental fault injection, particularly through chaos engineering practices, can be combined with predictive analytics to proactively identify and mitigate system vulnerabilities. Furthermore, the research emphasizes the importance of realism in error injection, the role of serverless architectures in resilience testing, and the integration of human factors in safety-critical environments.

A qualitative, theory-driven methodology is employed to construct a unified framework that bridges gaps between cloud system resilience and safety engineering in domains such as healthcare. The findings suggest that integrating chaos engineering with machine learning enhances predictive fault detection, improves failure propagation understanding, and supports adaptive system recovery mechanisms. Additionally, the study highlights that human-centered design and error taxonomy integration significantly contribute to reducing systemic risks in critical infrastructures.

The proposed framework offers a novel contribution by aligning chaos engineering practices with AI-driven reliability assessment and safety assurance principles. It provides a scalable and adaptable approach for organizations seeking to build resilient, trustworthy, and high-performance systems in increasingly complex technological landscapes.

Β 

Keywords

References

πŸ“„ Abrahamsen, H.B. et al. (2016). On the need for revising healthcare failure mode and effect analysis for assessing potential for patient harm in healthcare processes. Reliability Engineering and System Safety.
πŸ“„ Armbrust, M. et al. (2010). A view of cloud computing. Communications of the ACM.
πŸ“„ Gursel, E. et al. (2025). The role of AI in detecting and mitigating human errors in safety-critical industries: a review. Reliability Engineering and System Safety.
πŸ“„ Herbst, N. et al. (2018). Quantifying cloud performance and dependability: taxonomy, metric design, and emerging challenges. ACM Transactions on Modeling and Performance Evaluation of Computing Systems.
πŸ“„ Herscheid, L., Richter, D., & Polze, A. (2015). Experimental assessment of cloud software dependability using fault injection. Springer.
πŸ“„ Jaival, M., Mkrtchyan, K., & Kaplan, A. (2022). Serverless cloud functions-opportunity in chaos. IEEE.
πŸ“„ Sagar Kesarpu. (2025). Chaos Engineering as a Learning Framework: A Human-Centered Model for Developing High-Reliability Engineering Teams. The American Journal of Engineering and Technology, 7(12), 57–64. https://doi.org/10.37547/tajet/Volume07Issue12-05
πŸ“„ Kounev, S. et al. (2012). Providing dependability and resilience in the cloud: challenges and opportunities. Springer.
πŸ“„ Kratzke, N., & Quint, P.-C. (2017). Understanding cloud-native applications after 10 years of cloud computing-a systematic mapping study. Journal of Systems and Software.
πŸ“„ Lin, S., Wang, Y., & Jia, L. (2018). System reliability assessment based on failure propagation processes. Complexity.
πŸ“„ Paterson, C. et al. (2025). Safety assurance of machine learning for autonomous systems. Reliability Engineering and System Safety.
πŸ“„ Scheuner, J., & Leitner, P. (2020). Function-as-a-service performance evaluation: a multivocal literature review. Journal of Systems and Software.
πŸ“„ Singh, A. et al. (2024). Patient centric trustworthy AI in medical analysis and disease prediction: a comprehensive survey and taxonomy. Applied Soft Computing.
πŸ“„ Taib, I.A. et al. (2011). A review of medical error taxonomies: a human factors perspective. Safety Science.
πŸ“„ Xu, Z. et al. (2021). Machine learning for reliability engineering and safety applications: review of current status and future opportunities. Reliability Engineering and System Safety.
πŸ“„ Zhang, L. et al. (2021). Maximizing error injection realism for chaos engineering with system calls. IEEE Transactions on Dependable and Secure Computing.

Similar Articles

41-50 of 55

You may also start an advanced similarity search for this article.