Monitoring and operations
Rendering diagram…
| Metric | Suggested target | Notes |
|---|
| CPU utilization | <70% sustained | >80% → scale out |
| RAM utilization | <75% | Primary OOM risk signal |
| Disk I/O utilization | <70% | Hot on ES / MinIO tiers |
| Disk capacity | <80% | Alert immediately above |
| Network bandwidth | <70% | Heavy east/west on ES & object store |
- Prometheus + Grafana — dashboards ship with the product
- Zabbix — common in mainland enterprises
- Domestic suites — Huawei and other AIOps platforms
Endpoints:
/metrics # Prometheus scrape
/v1/health # Liveness
/v1/ready # Readiness (Kubernetes)
| Metric | Meaning | Watch for |
|---|
session.active | Concurrent sessions | Load pressure |
agent.step.duration | Step latency (p50/p95/p99) | User-perceived slowness |
tool.call.success_rate | Tool outcomes | Upstream system health |
compaction.triggered | Context compaction events | Overlong sessions |
hook.execution.count | Hooks fired | Misconfigured guardrails |
| Metric | Alert when |
|---|
| API 5xx rate | >1% for 5 minutes |
| LLM failure rate | >5% for 10 minutes |
| Per-tool failure rate | >10% for 15 minutes |
| Session timeouts | >3% |
LLM spend is the dominant variable; a built-in dashboard tracks:
| Signal | Granularity |
|---|
| Token consumption | Tenant / user / session / model / time window |
| Estimated cost (CNY) | Model rate cards |
| Latency | Model + call type |
| Cache hit rate | Prompt cache effectiveness |
| Symptom | Action |
|---|
| High per-session tokens | Review compaction; split huge documents |
| Tenant spike | Check for runaway jobs / apply quotas |
| Elevated latency | Consider faster-tier models |
| Low cache utilization | Template prompts / reuse stable prefixes |
Configure daily / monthly thresholds:
- Soft — notify admins, processing continues
- Hard — auto-downgrade cheaper models or pause usage
| Subsystem | Level |
|---|
| API backends | INFO |
| Agent runtime | INFO |
| Hooks / tools | INFO |
| Audit | Always INFO |
- Container logs → stdout/stderr
- Forward with Filebeat / Fluentd / Vector → ELK / Loki
- Audit feeds use separate pipelines from app logs
- name: high-error-rate
expr: rate(http_5xx[5m]) > 0.01
severity: critical
- name: llm-call-failure
expr: rate(llm_error[10m]) > 0.05
severity: warning
- name: disk-pressure
expr: disk_used_percent > 85
severity: critical
- name: tenant-token-spike
expr: sum by (tenant) (rate(llm_tokens[1h])) > 1000000
severity: warning
Channels: Feishu / WeCom / email / SMS / webhook.
| Tier | Technique | Typical trigger |
|---|
| API | More replicas behind Nginx/HAProxy | CPU >70% |
| Agent workers | Add worker pools | Queue depth thresholds |
| Search | Shard / add nodes | p95 queries >500 ms |
| Object storage | Add nodes | Capacity >70% |
Prefer bumping RAM/CPU on DB + ES nodes early—often cheaper until cluster limits bite.
- Higher-concurrency vendors
- On-prem inference with vLLM / TensorRT-LLM
- SSE streaming improves perceived responsiveness
| Asset | Frequency | Mechanism |
|---|
| RDBMS | Daily full + 5 min WAL | pg_basebackup / equivalents |
| Search | Weekly snapshot | ES snapshot to MinIO/S3-compatible |
| Object store | Near real-time | MinIO replication |
| Config | Each change | Git-tracked infra-as-code |
| RTO | RPO |
|---|
| Mission-critical ≤2 h | ≤15 min data loss budget |
| Non-critical ≤24 h | ≤24 h |
- Metro active/active — sub-2 min failover within region
- Geo async — ~1 h RPO cross-region DR
Scripts cover:
- One-click upgrade / rollback
- Daily health reports
- Archival / retention jobs
- Performance sampling & flamegraphs