Gullfoss AI
Autonomous Performance Engineering and Predictive Load Resiliency.
Cloud-Native Chaos Engineering for Resilient Kubernetes Environments.
LitmusChaos is a CNCF Graduated project providing an end-to-end framework for cloud-native chaos engineering. Its technical architecture is built on a Kubernetes-native design, utilizing Custom Resource Definitions (CRDs) to manage chaos experiments as declarative code. By 2026, LitmusChaos has solidified its position as the industry standard for platform teams transitioning from reactive monitoring to proactive resilience. It enables SREs to orchestrate complex failure scenarios—ranging from pod kills and network latency to cloud-provider API failures—integrated directly into CI/CD pipelines. The platform features ChaosCenter, a unified control plane for multi-tenant experiment management, and ChaosHub, a public repository of pre-built experiments. Its architecture supports GitOps workflows, allowing teams to version control their resilience tests alongside application code. The 2026 market landscape sees LitmusChaos as the primary open-source alternative to proprietary solutions like Gremlin, favored for its deep integration with the Prometheus/Grafana stack and its ability to run entirely within air-gapped or highly regulated environments.
Declarative checks that run before, during, and after chaos injection to validate steady-state via HTTP, K8s, or Prometheus queries.
Verified feedback from the global deployment network.
Post queries, share implementation strategies, and help other users.
A centralized repository of reusable chaos experiments maintained by the community and vendors.
ChaosCenter allows multiple teams to share a single installation with isolated projects and permissions.
Native integration with Git repositories to trigger experiments based on code commits or deployment events.
The ability to run multiple concurrent faults (e.g., CPU hog + Network Latency) to simulate complex cascading failures.
Triggering experiments based on specific Kubernetes events or Prometheus alerts.
A proprietary calculation metric based on the success rate of probes during a chaos run.
Validating that microservices correctly implement retries and circuit breakers during network instability.
Registry Updated:2/7/2026
Simulating an AWS/GCP zone outage to verify cross-region failover automation.
Ensuring that stateful sets (like PostgreSQL or MongoDB) elect a new leader without data loss during a crash.