Revelens: AI-Powered Kubernetes Operations
Table of Contents
Revelens: From Reactive RCA to Proactive Optimization #
Revelens is an AI-driven platform that revolutionizes how Site Reliability Engineers and DevOps teams manage Kubernetes at scaleβmoving beyond reactive root cause analysis into proactive performance optimization.
Project Website: revelens.dev
GitHub Repository: revelens
π Phase 2 Completion: Optimization Scenarios Framework #
What’s New #
Beyond failure injection and RCA, Revelens now includes optimization scenarios that challenge LLMs to improve measurable KPIs in working-but-suboptimal deployments. This enables systematic evaluation of how well different LLMs reason about performance bottlenecks, apply targeted changes, and converge on SLA targets within bounded turn limits.
Key Achievement: Multi-turn optimization feedback loop where LLMs apply a change and immediately see its KPI impact, enabling iterative refinement.
Architecture Highlights #
- Multi-Turn Feedback Loop: LLMs apply optimizations, measure KPI impact, decide next stepsβall in the same turn via
apply_kubectl_and_measuretool - Scenario-Agnostic Visualization: Generic visualization system adapts to any KPI type (latency, throughput, error rate, cost) by reading metric definitions from scenario metadata
- Constraint Satisfaction: Optimization must respect resource constraints (e.g., CPU limits, memory caps) while improving performance
- Safe Remediation: All kubectl commands validated against dangerous patterns before execution
Initial Scenario: OPT-1 NRF SBI Latency #
Deployment: Open5GS 5G Core
Problem: NRF throttled to 50m CPU β cascading SBI latency across network functions
Objectives:
- Reduce p99 SBI latency from baseline (~600ms) to < 100ms
- Keep error rate below 1%
- Maintain NRF CPU limit β€ 500m
Ground Truth Fix: Increase NRF CPU limit from 50m to 250β500m
Multi-Turn Example:
Turn 1: Measure baseline KPIs β identify high p99 latency
Turn 2: Check resource usage β see NRF CPU at 100% (throttled)
Turn 3: Apply CPU limit increase β re-measure β verify SLA satisfaction
π Framework Components #
Optimization Engine #
- Loads scenario YAML with objective function, KPI thresholds, and constraints
- Applies baseline manifests to inject suboptimal state
- Orchestrates multi-turn LLM loop with real-time KPI feedback
- Automatic visualization generation at completion
KPI Tools #
measure_sbi_latency(): HTTP/2 SBI latency probes via kubectl exec, returns p50/p95/p99, error rate, constraint satisfactionapply_kubectl_and_measure(): Apply kubectl change, wait for rollout, measure KPIs in single turnget_nf_resource_usage(): Show configured limits and live kubectl top metrics
Visualization #
Fully generic plots adapt to scenario KPIs:
- KPI Trajectories: Per-metric line plots with threshold markers
- Objective Timeline: Step plot showing when objective became satisfied
- Improvement Bars: Baseline vs. final values per KPI
- Model Comparison: Grouped bars across multiple runs
π§ Running Optimization Scenarios #
Prerequisites #
cd revelens
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements_revelens.txt
export OPENROUTER_API_KEY="your-key" # Set API key BEFORE running
kind create cluster
./scripts/llm/lib/deploy_open5gs.sh
Execute Optimization #
# Run with defaults
./scripts/llm/runners/run_llm_optimization.sh --scenario opt-1
# Override model and turn limit
./scripts/llm/runners/run_llm_optimization.sh \
--scenario opt-1 \
--llm-provider openrouter \
--llm-model deepseek-ai/DeepSeek-V3.2-TEE \
--max-turns 5
Output #
Results written to evaluation/generated/optimization/<scenario-id>/<timestamp>/:
optimization-result.jsonβ Complete session metadata, turn history, final KPIsplots/β Auto-generated KPI trajectory, improvement, and comparison visualizations
π― Research Questions #
Optimization scenarios enable investigation of:
- Model Capability: Which LLMs produce the most efficient optimization strategies?
- Turn Efficiency: How many iterations does each model need to reach SLA?
- Constraint Awareness: Do LLMs respect resource limits while optimizing?
- Reasoning Quality: Can we extract interpretable optimization insights from LLM behavior?
- Cost-Performance Trade-offs: How do models balance multiple conflicting objectives?
π Documentation #
For comprehensive details, see:
- README: github/revelens/README.md
- Scenario Guide: docs/optimization-scenarios.md
- Implementation Details: scripts/llm/engine/llm_optimization_engine.py
πΊοΈ Next Steps: Phase 3 #
- Krkn-AI Integration: Import community chaos scenarios and evaluate LLM performance
- Multi-Metric Optimization: Scenarios with competing objectives (latency vs. cost, throughput vs. memory)
- Constraint Learning: LLMs that learn and refine constraint awareness over multiple runs
π Related Research #
This work builds on foundational research in:
- Multi-agent RL for resource optimization
- Distributed systems performance tuning
- LLM reasoning and planning in dynamic environments
Contact: INTRIG, FEEC-UNICAMP
Last Updated: April 17, 2026
