Monitoring, Observability & Health KPIs

A production management plane must itself be observed. This page defines what “healthy” looks like for Meshery, the key performance indicators (KPIs) to track per component, and how to wire Meshery into your monitoring, tracing, logging, and alerting stack.

Health endpoints

Meshery Server exposes Kubernetes-compliant health endpoints that are the foundation of both self-healing and external monitoring:

EndpointUsed bySignals
/healthz/liveLiveness probeServer is running and responsive; provider capabilities are loaded.
/healthz/readyReadiness probeServer is ready to accept traffic.

Query readiness with a verbose breakdown for detail:

kubectl exec --namespace meshery deployment/meshery -- \
  curl -s 'http://localhost:8080/healthz/ready?verbose=1'
[+]capabilities ok
[i]extension extension package found
healthz check passed

[+] passed, [-] failed (marks the pod unhealthy), [i] informational. These endpoints are the right target for uptime/synthetic checks from outside the cluster as well.

The KPIs of Meshery’s health

Track these indicators as the canonical signals of a healthy Meshery deployment. Group them by component so dashboards and alerts map to where action is taken.

Meshery Server

KPIWhy it mattersHealthy signal
Liveness/readiness statusCore availabilityBoth passing; readiness stable, not flapping
API/GraphQL latency & error rateUser-facing responsivenessLow p95 latency; low 5xx/error rate
CPU utilizationSaturation under load/policy evalComfortably below limit
Memory utilizationHolds the MeshSync snapshot/registry; restart risk if exhaustedComfortable headroom below limit
Restart countCrash-loop / OOM detectionStable; no recurring restarts
Database/cache size growthDiscovery scope and retentionGrows then plateaus; no unbounded climb

Meshery Operator, MeshSync, Broker (per cluster)

KPIWhy it mattersHealthy signal
Operator pod running & reconcilingManages Broker/MeshSync lifecycleRunning; no reconcile errors
MeshSync running & syncingCluster snapshot freshnessRunning; snapshot updates as the cluster changes
Broker pod runningEventing path upStatefulSet pod ready
Broker memoryIn-memory message backlogStable; not climbing (a climb means the Server consumer is behind)
Connection/chip status (per cluster)End-to-end connectivityConnected; Broker/Operator/MeshSync following the connection

Remote Provider

KPIWhy it mattersHealthy signal
Provider reachability (egress)Login & durable state depend on itReachable over HTTPS; auth succeeding
Auth success rateUser accessHigh success; no spikes in login failures

Metrics with Prometheus and Grafana

Meshery integrates with Prometheus and Grafana, both for managing your infrastructure’s performance and for observing Meshery itself:

  • Scrape Kubernetes workload metrics (CPU, memory, restarts) for the Meshery namespace to drive the Server/Operator/MeshSync/Broker KPIs above.
  • The Broker exposes an HTTP monitoring endpoint on 8222/tcp (NATS monitoring) for Broker-level visibility.
  • Connect Prometheus and Grafana to Meshery to correlate management-plane health with the performance of the infrastructure under management. See the performance management guides.

Build dashboards that put Server availability, resource saturation, per-cluster connection health, and provider reachability on one pane.

Distributed tracing with OpenTelemetry

Meshery Server supports OpenTelemetry tracing, configured via OTEL_CONFIG (inline YAML). When unset, tracing is disabled. Use it in production to trace request flows and diagnose latency:

# Example OTEL_CONFIG (set as an env var / Helm env value)
service_name: meshery-server
service_version: 1.0.0
endpoint: otel-collector.observability:4317
insecure: false

Point the endpoint at your collector and keep insecure: false with proper TLS in production. Avoid the insecure local-development settings on a real deployment.

Centralized logging

  • Aggregate stdout/stderr. Meshery components log to standard streams; ship them to your centralized logging stack (e.g. Loki, Elasticsearch/OpenSearch, or a cloud logging service) for retention, search, and correlation.
  • Control verbosity with LOG_LEVEL. Values are 0=panic, 1=fatal, 2=error, 3=warn, 4=info (default), 5=debug, 6=trace. Run at info in production; raise to debug/trace temporarily when investigating. DEBUG=true forces debug-level logging.
  • Correlate across components. Tie Server, Operator, MeshSync, and Broker logs together (by namespace/labels) so a discovery or connectivity issue can be traced across the path.

Alerting

Turn the KPIs into actionable alerts. A solid baseline:

  • Availability: readiness failing or pod not ready for N minutes; recurring restarts (crash loop / OOM).
  • Saturation: Server CPU or memory sustained near limit; database/cache size growth anomalous.
  • Eventing: Broker memory climbing; Broker or MeshSync pod not running; Operator reconcile errors.
  • Connectivity: a cluster connection unhealthy; Broker endpoint unreachable from the Server.
  • Identity: Remote Provider unreachable; spike in authentication failures.

Route these to the team that operates Meshery, and include the relevant runbook links (troubleshooting, sizing, networking) in the alert.

Synthetic and connectivity checks

  • mesheryctl system check runs pre- and post-deployment health checks, including connectivity, and is well suited to scheduled synthetic validation. See the reference.
  • External uptime checks against /healthz/ready validate the full ingress β†’ Server path (TLS, routing, readiness) from a user’s perspective.
  • Per-connection checks in the UI provide on-demand validation of each cluster’s connectivity.

Troubleshooting entry points

When a KPI trips, these guides are the fastest path to resolution:

Monitoring checklist

  • Liveness/readiness probes enabled; external uptime check on /healthz/ready.
  • Workload metrics scraped for the Meshery namespace; dashboards for Server/Operator/MeshSync/Broker KPIs.
  • Dedicated alerts on Server memory and Broker memory.
  • Per-cluster connection health monitored.
  • Remote Provider reachability and auth success monitored.
  • OpenTelemetry tracing configured to a real collector (insecure: false).
  • Logs centralized; LOG_LEVEL=4 (info) in steady state.
  • mesheryctl system check scheduled as a synthetic check.