Monitoring and operations

Three monitoring pillars

Rendering diagram…

System monitoring

KPIs

Metric	Suggested target	Notes
CPU utilization	<70% sustained	>80% → scale out
RAM utilization	<75%	Primary OOM risk signal
Disk I/O utilization	<70%	Hot on ES / MinIO tiers
Disk capacity	<80%	Alert immediately above
Network bandwidth	<70%	Heavy east/west on ES & object store

Integrations

Prometheus + Grafana — dashboards ship with the product
Zabbix — common in mainland enterprises
Domestic suites — Huawei and other AIOps platforms

Endpoints:

/metrics       # Prometheus scrape
/v1/health     # Liveness
/v1/ready      # Readiness (Kubernetes)

Application monitoring

Agent runtime metrics

Metric	Meaning	Watch for
`session.active`	Concurrent sessions	Load pressure
`agent.step.duration`	Step latency (p50/p95/p99)	User-perceived slowness
`tool.call.success_rate`	Tool outcomes	Upstream system health
`compaction.triggered`	Context compaction events	Overlong sessions
`hook.execution.count`	Hooks fired	Misconfigured guardrails

Error budgets

Metric	Alert when
API 5xx rate	>1% for 5 minutes
LLM failure rate	>5% for 10 minutes
Per-tool failure rate	>10% for 15 minutes
Session timeouts	>3%

LLM cost monitoring

LLM spend is the dominant variable; a built-in dashboard tracks:

Signal	Granularity
Token consumption	Tenant / user / session / model / time window
Estimated cost (CNY)	Model rate cards
Latency	Model + call type
Cache hit rate	Prompt cache effectiveness

Savings levers

Symptom	Action
High per-session tokens	Review compaction; split huge documents
Tenant spike	Check for runaway jobs / apply quotas
Elevated latency	Consider faster-tier models
Low cache utilization	Template prompts / reuse stable prefixes

Budget alerts

Configure daily / monthly thresholds:

Soft — notify admins, processing continues
Hard — auto-downgrade cheaper models or pause usage

Logging

Default verbosity

Subsystem	Level
API backends	INFO
Agent runtime	INFO
Hooks / tools	INFO
Audit	Always INFO

Shipping logs

Container logs → stdout/stderr
Forward with Filebeat / Fluentd / Vector → ELK / Loki
Audit feeds use separate pipelines from app logs

Sample alert rules

- name: high-error-rate
  expr: rate(http_5xx[5m]) > 0.01
  severity: critical
 
- name: llm-call-failure
  expr: rate(llm_error[10m]) > 0.05
  severity: warning
 
- name: disk-pressure
  expr: disk_used_percent > 85
  severity: critical
 
- name: tenant-token-spike
  expr: sum by (tenant) (rate(llm_tokens[1h])) > 1000000
  severity: warning

Channels: Feishu / WeCom / email / SMS / webhook.

Performance tuning

Horizontal scaling

Tier	Technique	Typical trigger
API	More replicas behind Nginx/HAProxy	CPU >70%
Agent workers	Add worker pools	Queue depth thresholds
Search	Shard / add nodes	p95 queries >500 ms
Object storage	Add nodes	Capacity >70%

Vertical scaling

Prefer bumping RAM/CPU on DB + ES nodes early—often cheaper until cluster limits bite.

LLM acceleration

Higher-concurrency vendors
On-prem inference with vLLM / TensorRT-LLM
SSE streaming improves perceived responsiveness

Backup & disaster recovery

Backup cadence

Asset	Frequency	Mechanism
RDBMS	Daily full + 5 min WAL	pg_basebackup / equivalents
Search	Weekly snapshot	ES snapshot to MinIO/S3-compatible
Object store	Near real-time	MinIO replication
Config	Each change	Git-tracked infra-as-code

Targets

RTO	RPO
Mission-critical ≤2 h	≤15 min data loss budget
Non-critical ≤24 h	≤24 h

Topology options

Metro active/active — sub-2 min failover within region
Geo async — ~1 h RPO cross-region DR

Automation shipped with the product

Scripts cover:

One-click upgrade / rollback
Daily health reports
Archival / retention jobs
Performance sampling & flamegraphs

Monitoring and operations

On this page