Rhasspy Larynx
High-quality, privacy-first neural text-to-speech for local edge computing.
A Multilingual Single-Speaker Speech Corpus for High-Fidelity Text-to-Speech Synthesis.
CSS10 is a seminal open-source dataset designed for training single-speaker Text-to-Speech (TTS) models across ten diverse languages: German, Greek, Spanish, Finnish, French, Hungarian, Japanese, Dutch, Russian, and Chinese. Originating from LibriVox audiobooks, the project provides a consistent technical baseline for researchers and developers in the speech synthesis domain. Each sub-dataset consists of approximately 10 to 20 hours of high-quality audio paired with normalized transcriptions. In the 2026 market, CSS10 remains a critical infrastructure component for 'Edge-TTS' applications and Small Language Models (SLMs). Its architecture allows for efficient transfer learning, enabling developers to create localized voice assets without the massive compute requirements of foundation models. By providing a uniform format (LJSpeech style), it simplifies the training pipeline for popular architectures like FastSpeech 2, VITS, and Tacotron 2. It is particularly valued in 2026 for fine-tuning on-device speech interfaces where privacy and low latency are prioritized over cloud-based synthesis. The dataset's permissive licensing encourages both academic innovation and commercial prototyping in the rapidly expanding multilingual voice interface market.
Standardizes all 10 languages into a single directory structure and metadata format compatible with almost all modern TTS frameworks.
High-quality, privacy-first neural text-to-speech for local edge computing.
A high-speed, fully convolutional neural architecture for multi-speaker text-to-speech synthesis.
Real-time neural text-to-speech architecture for massive-scale multi-speaker synthesis.
Adversarial high-fidelity speech synthesis for low-latency production environments.
Verified feedback from the global deployment network.
Post queries, share implementation strategies, and help other users.
Uses clean, professional-grade audiobook recordings that ensure consistent emotive quality and low background noise.
Metadata includes specific phoneme mappings for languages like Japanese and Chinese to handle logographic scripts.
The single-speaker nature makes it an ideal base for training 'Teacher' models in Knowledge Distillation setups.
Provides datasets for languages like Hungarian and Finnish which are often overlooked by major tech providers.
Pre-built scripts to convert numbers, abbreviations, and symbols into spoken forms across all 10 languages.
Compatible with multi-speaker models that use language IDs to share phonetic information across the 10-language set.
Providing affordable, high-quality audio narration in multiple languages for digital textbooks.
Registry Updated:2/7/2026
Enabling voice feedback on low-power devices without an internet connection.
Creating more natural-sounding voices for visually impaired users in non-English speaking regions.