Retrieval-Augmented Graph Reasoning with Large Language Models for Explainable Incident Diagnosis
DOI:
https://doi.org/10.71465/fair531Keywords:
Retrieval-Augmented Generation, Large Language Models, Graph Neural Networks, Incident Diagnosis, Root Cause Analysis, Explainable AI, Causal ReasoningAbstract
The increasing complexity of modern distributed systems has created significant challenges in incident diagnosis and root cause analysis. Traditional approaches often lack explainability and struggle with the dynamic nature of system failures, while pure machine learning methods suffer from limited interpretability and contextual understanding. This paper proposes a novel framework that integrates Retrieval-Augmented Generation (RAG) with graph-based reasoning and Large Language Models (LLMs) to enable explainable incident diagnosis in complex systems. The proposed approach leverages knowledge graphs to capture causal relationships among system components, employs retrieval mechanisms to access relevant historical incident data, and utilizes LLMs to generate human-interpretable explanations for diagnosed incidents. Through comprehensive evaluation on real-world incident datasets, our method demonstrates superior performance in fault localization accuracy, achieving 92.3% precision while providing transparent reasoning paths that enhance engineer trust and accelerate remediation workflows. The framework addresses critical limitations in existing approaches by combining the structured reasoning capabilities of graph neural networks with the semantic understanding and generation abilities of large language models, thereby advancing the state-of-the-art in intelligent operations and system reliability engineering.
Downloads
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.