Next-generation Kaldi speech processing recipes powered by k2 and PyTorch.
Icefall is a core component of the Next-gen Kaldi project, serving as a comprehensive collection of recipes for speech-related tasks including Automatic Speech Recognition (ASR), Speaker Identification, and Keyword Spotting. Built atop the k2 framework and PyTorch, Icefall transitions the traditional Finite State Transducer (FST) approach of Kaldi into a modern, differentiable framework. Its technical architecture focuses on efficiency through the 'pruned RNN-T' and 'Zipformer' models, which significantly reduce computational overhead while maintaining state-of-the-art accuracy. By 2026, Icefall has positioned itself as the industry standard for researchers and engineers who require high-performance, customizable speech models that can be deployed on both cloud infrastructure and edge devices via its companion inference engine, Sherpa. It bridges the gap between academic research and production-grade deployments by providing reproducible scripts for massive datasets like LibriSpeech, Gigaspeech, and WenetSpeech, supporting both streaming and non-streaming applications with low-latency requirements.
A highly efficient Transformer variant that uses downsampling and upsampling to reduce sequence length, significantly speeding up computation.
Verified feedback from the global deployment network.
Post queries, share implementation strategies, and help other users.
A memory-efficient implementation of the Transducer loss that prunes the lattice during training.
Uses k2's ragged tensors to perform finite-state automata operations directly in PyTorch.
Includes recipes for Emformer and streaming Zipformer models designed for real-time inference.
Integrated SentencePiece support for subword modeling across multiple languages.
Seamless export path to Sherpa-ONNX and Sherpa-NCNN for cross-platform deployment.
Native support for Lhotse cuts and manifests for dynamic data augmentation and batching.
Providing real-time text overlays for live broadcasts with minimal delay.
Registry Updated:2/7/2026
Feed live PCM audio stream to the inference engine.
Running ASR on resource-constrained hardware without cloud dependency.
Transcribing high-volume call recordings in multiple languages for sentiment analysis.