Lhotse
A high-performance Python library for speech data representation, manipulation, and efficient deep learning pipelines.

End-to-End Speech Processing Toolkit for State-of-the-Art ASR, TTS, and Speech Translation.
ESPnet is an open-source end-to-end speech processing toolkit primarily based on PyTorch and Kaldi-style data preprocessing. It encompasses a wide array of speech-related tasks, including Automatic Speech Recognition (ASR), Text-to-Speech (TTS), Speech Translation (ST), Speech Enhancement (SE), and Speaker Diarization. Its architecture is built around the philosophy of 'End-to-End' (E2E) modeling, utilizing advanced architectures such as Transformers, Conformers, and the latest E-Branchformer. By 2026, ESPnet has solidified its position as the industry standard for researchers and enterprise developers who require fine-grained control over acoustic modeling and linguistic integration that commercial APIs cannot provide. It supports unified training pipelines, permitting users to transition from raw audio data to deployable models within a single framework. The toolkit leverages Warp-CTC and integration with Hugging Face, making it highly interoperable with the broader AI ecosystem. It is particularly valued for its 'recipe' system, which provides reproducible, step-by-step scripts for training on various public and private datasets, ensuring high performance even in low-resource language scenarios.
Combines CTC, Attention, and Transducer loss functions in a multi-objective learning framework.
A high-performance Python library for speech data representation, manipulation, and efficient deep learning pipelines.
Robust, lightweight forced aligner for precise word-level audio-to-text synchronization.
The industry-standard open-source engine for high-precision phonetic speech alignment and acoustic modeling.
State-of-the-art end-to-end voice conversion for research and enterprise-scale speech transformation.
Verified feedback from the global deployment network.
Post queries, share implementation strategies, and help other users.
Implementation of Enhanced Branchformer which captures both local and global dependencies more efficiently than standard Transformers.
Supports block-processing and chunk-based attention for real-time inference.
Includes VITS, Tacotron2, and FastSpeech2 with d-vector speaker embeddings.
Direct upload/download capabilities to the Hugging Face Model Hub.
Integrated Conv-TasNet and DPRNN models for noise reduction and speaker isolation.
Utilizes optimized CUDA kernels for CTC loss calculation and beam search decoding.
Off-the-shelf APIs fail to recognize niche legal terminology and require data privacy.
Registry Updated:2/7/2026
Reducing latency in live international conferences.
Building ASR for languages with less than 10 hours of labeled data.