Open Access

Bridging The Gap: A Strategic Framework for Integrating Site Reliability Engineering with Legacy Retail Infrastructure

4 School of Computing, National University of Singapore, Singapore
4 Faculty of Computer Science, Higher School of Economics, Moscow, Russia

Abstract

Background: The retail sector faces intense pressure to ensure high availability and low latency, especially during peak traffic events. However, many established retailers operate on complex, monolithic legacy infrastructures that are inherently resistant to modern DevOps practices. Site Reliability Engineering (SRE), pioneered in cloud-native environments, offers a compelling model for managing reliability, yet its application in 'brownfield' legacy contexts is poorly understood.

Objectives: This study aims to (1) analyze the socio-technical friction points when implementing SRE principles within legacy retail organizations and (2) propose and evaluate a phased framework for this transition.

Methods: We employed a qualitative, multi-case study methodology, analyzing three anonymized retail organizations (grocery, e-commerce, department store) undergoing SRE adoption. Data was collected through 30 semi-structured interviews with engineering and leadership staff, supplemented by an analysis of internal documentation (postmortems, roadmaps, and monitoring data). We analyzed these cases through the lens of a proposed three-phase implementation framework: (1) Stabilize & Observe, (2) Automate & Abstract, and (3) Modernize & Scale.

Results: The findings indicate that the most significant barriers are cultural rather than technical, particularly the resistance to blameless postmortems and the adoption of error budgets. Defining meaningful Service Level Objectives (SLOs) for monolithic applications emerged as a complex initial hurdle. However, the study found that SRE-derived data (SLO breach reports, toil logs) provided a critical, objective language for prioritizing technical debt and de-risking modernization efforts, such as API abstraction and the introduction of new microservices.

Conclusion: SRE is a viable and necessary strategy for legacy retail, acting as a catalyst for incremental modernization. Successful adoption hinges on adapting SRE principles, prioritizing cultural change alongside technical automation, and using SRE metrics to bridge the divide between operations and development.

Β 

Keywords

References

πŸ“„ Allspaw, J. (2017). Blameless PostMortems and a Just Culture: A Guide to Incident Investigation. Etsy Engineering. https://codeascraft.com
πŸ“„ Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media.
πŸ“„ Burns, B., Grant, B., Oppenheimer, D., Brewer, E., & Wilkes, J. (2016). Borg, Omega, and Kubernetes. Communications of the ACM, 59(5), 50–57. https://doi.org/10.1145/2890784
πŸ“„ Gartner. (2023). Predicts 2023: Legacy Systems Modernization Strategies for CIOs. Gartner Research.
πŸ“„ Kumar Tiwari, S., Sooraj Ramachandran, Paras Patel, & Vamshi Krishna Jakkula. (2025). The Role of Chaos Engineering in Enhancing System Resilience and Reliability in Modern Distributed Architectures. International Journal of Computational and Experimental Science and Engineering, 11(3). https://doi.org/10.22399/ijcesen.3885
πŸ“„ Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook: How to Create World-Class Agility, Reliability, & Security in Technology Organizations. IT Revolution Press.
πŸ“„ Krief, M. (2019). Learning DevOps: Continuously Deliver Better Software. Packt Publishing.
πŸ“„ OpenSLO. (2021). Open Specification for SLOs. https://openslo.com
πŸ“„ Thongmak, M. (2022). Applying AI in IT Operations: Anomaly Detection and Incident Prediction in Legacy Systems. Journal of Information Technology Management, 33(1), 35–42.
πŸ“„ Woodcock, S. (2020). Automating Legacy Systems: Practices and Pitfalls. IEEE Software, 37(4), 67–73. https://doi.org/10.1109/MS.2020.2996582
πŸ“„ Zero-Trust Architecture in Java Microservices. (2025). International Journal of Networks and Security, 5(01), 202-214. https://doi.org/10.55640/ijns-05-01-12
πŸ“„ Vikram Singh, 2025, Policy Optimization for Anti-Money Laundering (AML) Compliance using AI Techniques: A Machine Learning Approach to Enhance Banking Regulatory Compliance, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT) Volume 14, Issue 04 (April 2025)

Similar Articles

1-10 of 41

You may also start an advanced similarity search for this article.