AI-DRIVEN ROOT CAUSE ANALYSIS FRAMEWORK FOR DISTRIBUTED MICROSERVICES ARCHITECTURES

Authors

  • Awodele S. O.; Faruna, J. O.; Mustapha M. M.; Ojuawo O. O.; Olorunyomi O. B.; Chukwulobe I.;Fayemi T. A. Author

Keywords:

Root Cause Analysis (RCA), RCASage, AI-Driven, AI-Augmented, Causal Inference, Distributed Systems, Graph Neural Networks, Graph-based Deep Learning, Anomaly Detection, Software Defect Prediction, Continuous Integration/Continuous Deployment (CI/CD), Microservices Architecture, Explainable AI (XAI), CI/CD Proactive Monitoring

Abstract

Multi-layered IT systems are characterized by persistent system downtime, which creates significant operational burdens due to high costs and prolonged troubleshooting processes. Traditional root cause analysis (RCA) techniques are largely ineffective because they are predominantly reactive and manual, making them unsuitable for the scale and complex interdependencies of modern microservices architectures. This study addresses the critical “observability–complexity gap,” a condition in which contemporary monitoring tools generate vast amounts of correlated data but fail to provide true causal resolution. To address this challenge, the study introduces RCASage, a novel AI-augmented inference engine designed for autonomous fault identification. The core innovation of RCASage lies in its hybrid architecture, which comprises a three-stage pipeline: (1) multi-modal telemetry data ingestion and dynamic dependency graph construction; (2) unsupervised anomaly detection using LSTM autoencoders combined with Natural Language Processing (NLP) classifiers; and (3) an Autonomous Inference Engine (AIE). This engine uniquely integrates Graph Neural Networks (GNNs) with the Neural Granger Causal Discovery algorithm to distinguish true root causes from downstream symptomatic effects. By shifting RCA from symptom correlation to causal inference, RCASage surpasses existing approaches such as beta-binomial inference and event-graph-based systems (e.g., GROOT). Furthermore, it bridges the gap between development and operations by incorporating Just-in-Time (JIT) defect prediction into the CI/CD pipeline. Empirical evaluation demonstrates that the proposed AI-augmented framework reduces Mean Time to Resolution (MTTR) by over 90% compared to traditional manual approaches. In addition, RCASage advances the RCA paradigm by embedding Explainable AI (XAI) principles, providing transparent causal explanations alongside visual dependency graphs. This positions RCASage as an intelligent, evidence-based, and autonomous solution for root cause analysis in contemporary digital infrastructures.

Downloads

Published

2026-01-29