Operational Readiness Checklist & Known Caveats
This page brings the whole set together: a single checklist you can work through before declaring a Meshery deployment production-ready, the upgrade strategy to adopt, and the known caveats to design around. Each item links back to the page where it is explained in full.
Upgrade strategy
Meshery is a composition of components—Server, UI, Operator, MeshSync, Broker,
Adapters, and mesheryctl—some of which upgrade together and some
independently. For Kubernetes production deployments:
Use Helm and pin versions. Upgrade with
helm upgrade, and pin the image to an immutable version tag rather than trackingstable-latest, so upgrades are deliberate and reproducible (security hardening).Use upgrade-friendly probe values. Capability reloading can make the Server briefly unavailable during an upgrade. The chart provides
values-upgrade.yaml(startup probe, higher failure thresholds) so pods are not killed mid-initialization:helm upgrade meshery meshery/meshery --namespace meshery \ -f https://raw.githubusercontent.com/meshery/meshery/master/install/kubernetes/helm/meshery/values-upgrade.yaml \ --wait --timeout 10mFor reproducible production upgrades, download
values-upgrade.yaml, version-control it alongside your own values, and reference your local copy (with a pinned chart and image version) rather than fetching frommasterat upgrade time.Choose a release channel deliberately.
stablefor production;edgeonly for testing. The channel is reflected byRELEASE_CHANNEL.Mind component groupings. See the Upgrade Guide for which components move together (e.g., Server/UI/Load Generators/Database) versus independently (Operator and its controllers, Adapters,
mesheryctl).Rehearse and roll back. Because durable state lives with the Remote Provider and the local database is a cache, rolling back the deployment is low-risk for data—validate the rollback path anyway.
An upgrade that recreates the Server pod discards the local cache; this is expected and safe. The Remote Provider preserves durable state and MeshSync rebuilds the cluster snapshot. Do not block upgrades trying to preserve the ephemeral database.
Known caveats and limitations
Plan around these known characteristics—they are by design and shape good production architecture:
- The database is an ephemeral, single-writer cache. It is SQLite/Bitcask on local disk, not a shared multi-writer datastore. Durable state must come from a Remote Provider; this also shapes how Server replication behaves (HA & Resiliency).
- The Broker persists messages in memory only. No persistent volume is used. Brief interruptions are bridged by NATS topic persistence, but a far-behind or unavailable Server consumer lets Broker memory climb (sizing).
- The Local Provider has no authentication. Never run shared production on it; preselect a Remote Provider (identity).
- WebSocket support is mandatory at the ingress. Without it the UI loads but never updates live (networking).
- OAuth requires the external callback URL. Behind a proxy, an unset or wrong
MESHERY_SERVER_CALLBACK_URLbreaks login. - Out-of-cluster requires a reachable Broker endpoint. Cross-cloud reachability of the Broker is the most common multi-cloud failure (multi-cloud).
- Discovery cost scales with cluster size and churn. Without MeshSync blacklisting, very large/churny clusters dominate Server memory and database growth.
- Policy evaluation is time-boxed. Large designs may hit
POLICY_EVAL_TIMEOUT(default3m); tune CPU and the timeout together. - The chart ships permissive defaults for portability. No resource
requests/limits, empty security contexts, and a
stable-latest,Always-pull image are convenient for getting started but are not production settings—set them explicitly.
The consolidated production-readiness checklist
Work through these before going live. Each group links to its source page.
Deployment model
- Topology chosen deliberately (in-cluster vs. out-of-cluster; Docker vs. Kubernetes), with Kubernetes + Helm for production. Deployment Models
- One Operator/MeshSync/Broker per managed cluster (or embedded mode chosen deliberately).
Sizing & performance
- Explicit CPU/memory requests and limits set for Server, Operator, MeshSync, and Broker. Sizing
- Server memory headroom for the MeshSync snapshot; Broker memory headroom for bursts.
- MeshSync
informer_configblacklist tuned for large clusters. - Storage provisioned for the Server cache; autoscaling configured where appropriate.
High availability & resiliency
- Liveness/readiness probes enabled and tuned. HA & Resiliency
- Anti-affinity spreads replicas across nodes/zones.
- Helm values, env config, and connection definitions in version control (your real backup).
- Redeploy-from-config recovery drill validated.
Networking & connectivity
- Ingress fronts Meshery with TLS; WebSocket upgrades verified. Networking
-
MESHERY_SERVER_CALLBACK_URLset to the external URL. - Broker endpoint reachable from the Server and restricted to its origin.
- Egress to the Remote Provider and registries allowed; network policies applied.
Security & identity
- Remote Provider preselected (
PROVIDER,PROVIDER_BASE_URLS); Local Provider not used. Identity - Least-privilege RBAC;
rbac.nodesonly where required; per-cluster credentials. - Hardened
securityContext(non-root, dropped caps, read-only root FS with writable data volume) on Server and Operator. Security Hardening - Secrets (kubeconfig, provider, pull) sourced from Secrets/external manager.
- Images pinned to immutable tags from a trusted/mirrored registry.
Multi-cluster & multi-cloud
- One least-privilege kubeconfig context per cluster. Multi-Cloud
- Per-cloud node-watch RBAC and Broker endpoint reachability validated.
- Private connectivity for private clusters; cross-region latency assessed.
Monitoring & observability
- External uptime check on
/healthz/ready; workload metrics scraped. Monitoring & KPIs - Dedicated alerts on Server memory and Broker memory; per-cluster connection health and provider reachability monitored.
- OpenTelemetry tracing to a real collector; logs centralized at
LOG_LEVEL=4. -
mesheryctl system checkscheduled as a synthetic check.
Operations
- Upgrade procedure uses
values-upgrade.yaml;stablerelease channel. Upgrade Guide - Runbooks reference the troubleshooting guides.
- Ownership, on-call, and escalation defined for the Meshery deployment.
Where to go next
- Revisit any area above via its linked page in this set.
- For component internals, see the Meshery Architecture reference.
- For runtime configuration, see the Server Environment Variables reference.