An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning

Ze Yang, Yihong Jin, Juntian Liu, Xinhe Xu·June 09, 2025

Summary

The Intelligent Fault Self-Healing Mechanism (IFSHM) combines Large Language Models (LLMs) and Deep Reinforcement Learning (DRL) for cloud AI systems. It features a two-stage hybrid architecture for accurate fault mode identification and dynamic fault type matching. The method enhances exploration efficiency, generalizes well, and adapts to new failure modes through a memory-guided meta-controller. Compared to existing methods, the IFSHM significantly shortens system recovery time, especially in unknown fault scenarios.

Introduction
Background
Overview of cloud AI systems and their challenges
Importance of fault self-healing in maintaining system reliability
Objective
To present the IFSHM as a novel approach for enhancing fault self-healing in cloud AI systems
Method
Two-Stage Hybrid Architecture
Stage 1: Accurate Fault Mode Identification
Utilization of Large Language Models (LLMs)
Techniques for extracting fault signatures
Stage 2: Dynamic Fault Type Matching
Application of Deep Reinforcement Learning (DRL)
Strategies for efficient exploration and exploitation
Memory-Guided Meta-Controller
Role in learning from past failures
Mechanism for adapting to new failure modes
Exploration Efficiency and Generalization
Enhancements in exploration efficiency
Generalization capabilities across different fault types
Results
System Recovery Time Improvement
Comparative analysis with existing methods
Performance metrics for system recovery time
Unknown Fault Scenarios
Case studies demonstrating adaptability
Quantitative results on handling unseen faults
Conclusion
Summary of Contributions
Key innovations of the IFSHM
Future Work
Potential areas for further research
Integration with emerging technologies
Basic info
papers
artificial intelligence
Advanced features
Insights
What is the two-stage hybrid architecture of the Intelligent Fault Self-Healing Mechanism (IFSHM)?
How does the IFSHM enhance exploration efficiency and adapt to new failure modes?
What are the key innovations of the IFSHM compared to existing fault recovery methods?
How do Large Language Models (LLMs) and Deep Reinforcement Learning (DRL) contribute to the IFSHM's functionality?