Operational Readiness Checklist & Known Caveats

This page brings the whole set together: a single checklist you can work through before declaring a Meshery deployment production-ready, the upgrade strategy to adopt, and the known caveats to design around. Each item links back to the page where it is explained in full.

Upgrade strategy

Meshery is a composition of components—Server, UI, Operator, MeshSync, Broker, Adapters, and mesheryctl—some of which upgrade together and some independently. For Kubernetes production deployments:

  • Use Helm and pin versions. Upgrade with helm upgrade, and pin the image to an immutable version tag rather than tracking stable-latest, so upgrades are deliberate and reproducible (security hardening).

  • Use upgrade-friendly probe values. Capability reloading can make the Server briefly unavailable during an upgrade. The chart provides values-upgrade.yaml (startup probe, higher failure thresholds) so pods are not killed mid-initialization:

    helm upgrade meshery meshery/meshery --namespace meshery \
      -f https://raw.githubusercontent.com/meshery/meshery/master/install/kubernetes/helm/meshery/values-upgrade.yaml \
      --wait --timeout 10m
    

    For reproducible production upgrades, download values-upgrade.yaml, version-control it alongside your own values, and reference your local copy (with a pinned chart and image version) rather than fetching from master at upgrade time.

  • Choose a release channel deliberately. stable for production; edge only for testing. The channel is reflected by RELEASE_CHANNEL.

  • Mind component groupings. See the Upgrade Guide for which components move together (e.g., Server/UI/Load Generators/Database) versus independently (Operator and its controllers, Adapters, mesheryctl).

  • Rehearse and roll back. Because durable state lives with the Remote Provider and the local database is a cache, rolling back the deployment is low-risk for data—validate the rollback path anyway.

Known caveats and limitations

Plan around these known characteristics—they are by design and shape good production architecture:

  • The database is an ephemeral, single-writer cache. It is SQLite/Bitcask on local disk, not a shared multi-writer datastore. Durable state must come from a Remote Provider; this also shapes how Server replication behaves (HA & Resiliency).
  • The Broker persists messages in memory only. No persistent volume is used. Brief interruptions are bridged by NATS topic persistence, but a far-behind or unavailable Server consumer lets Broker memory climb (sizing).
  • The Local Provider has no authentication. Never run shared production on it; preselect a Remote Provider (identity).
  • WebSocket support is mandatory at the ingress. Without it the UI loads but never updates live (networking).
  • OAuth requires the external callback URL. Behind a proxy, an unset or wrong MESHERY_SERVER_CALLBACK_URL breaks login.
  • Out-of-cluster requires a reachable Broker endpoint. Cross-cloud reachability of the Broker is the most common multi-cloud failure (multi-cloud).
  • Discovery cost scales with cluster size and churn. Without MeshSync blacklisting, very large/churny clusters dominate Server memory and database growth.
  • Policy evaluation is time-boxed. Large designs may hit POLICY_EVAL_TIMEOUT (default 3m); tune CPU and the timeout together.
  • The chart ships permissive defaults for portability. No resource requests/limits, empty security contexts, and a stable-latest, Always-pull image are convenient for getting started but are not production settings—set them explicitly.

The consolidated production-readiness checklist

Work through these before going live. Each group links to its source page.

Deployment model

  • Topology chosen deliberately (in-cluster vs. out-of-cluster; Docker vs. Kubernetes), with Kubernetes + Helm for production. Deployment Models
  • One Operator/MeshSync/Broker per managed cluster (or embedded mode chosen deliberately).

Sizing & performance

  • Explicit CPU/memory requests and limits set for Server, Operator, MeshSync, and Broker. Sizing
  • Server memory headroom for the MeshSync snapshot; Broker memory headroom for bursts.
  • MeshSync informer_config blacklist tuned for large clusters.
  • Storage provisioned for the Server cache; autoscaling configured where appropriate.

High availability & resiliency

  • Liveness/readiness probes enabled and tuned. HA & Resiliency
  • Anti-affinity spreads replicas across nodes/zones.
  • Helm values, env config, and connection definitions in version control (your real backup).
  • Redeploy-from-config recovery drill validated.

Networking & connectivity

  • Ingress fronts Meshery with TLS; WebSocket upgrades verified. Networking
  • MESHERY_SERVER_CALLBACK_URL set to the external URL.
  • Broker endpoint reachable from the Server and restricted to its origin.
  • Egress to the Remote Provider and registries allowed; network policies applied.

Security & identity

  • Remote Provider preselected (PROVIDER, PROVIDER_BASE_URLS); Local Provider not used. Identity
  • Least-privilege RBAC; rbac.nodes only where required; per-cluster credentials.
  • Hardened securityContext (non-root, dropped caps, read-only root FS with writable data volume) on Server and Operator. Security Hardening
  • Secrets (kubeconfig, provider, pull) sourced from Secrets/external manager.
  • Images pinned to immutable tags from a trusted/mirrored registry.

Multi-cluster & multi-cloud

  • One least-privilege kubeconfig context per cluster. Multi-Cloud
  • Per-cloud node-watch RBAC and Broker endpoint reachability validated.
  • Private connectivity for private clusters; cross-region latency assessed.

Monitoring & observability

  • External uptime check on /healthz/ready; workload metrics scraped. Monitoring & KPIs
  • Dedicated alerts on Server memory and Broker memory; per-cluster connection health and provider reachability monitored.
  • OpenTelemetry tracing to a real collector; logs centralized at LOG_LEVEL=4.
  • mesheryctl system check scheduled as a synthetic check.

Operations

  • Upgrade procedure uses values-upgrade.yaml; stable release channel. Upgrade Guide
  • Runbooks reference the troubleshooting guides.
  • Ownership, on-call, and escalation defined for the Meshery deployment.

Where to go next