CoVoST

Overview

CoVoST (Conversational Voice-to-Speech Translation) is a large-scale, multilingual speech-to-text translation corpus developed by Facebook Research. It addresses the lack of parallel data for end-to-end speech translation (ST) model training. Built upon the Common Voice dataset, CoVoST includes translations from English into 15 languages and from 21 languages into English. The corpus comprises approximately 2,880 hours of speech data from 78,000 speakers. It is designed to foster ST research by providing a diversified, openly licensed dataset. CoVoST facilitates the training of end-to-end ST models, which offer system simplicity, lower inference latency, and reduced compounding errors compared to cascaded ST systems. Data splitting scripts and Fairseq S2T examples are provided to facilitate model training.

Common tasks

Speech-to-text translation Multilingual corpus generation End-to-end model training Data splitting

FAQ

View all

What is CoVoST?

CoVoST is a large-scale multilingual speech-to-text translation corpus based on Common Voice, designed to foster ST research.

What licenses are used in CoVoST?

CoVoST data is licensed under CC0. Tatoeba sentences are CC BY 2.0 FR. Tatoeba speeches are under various CC licenses. Anything else is CC BY-NC 4.0.

How can I download the CoVoST data?

You can download the data from the GitHub repository, including Common Voice audio clips, transcripts, and CoVoST translations.

How do I generate data splits for training?

Use the get_covost_splits.py script with the appropriate version, language codes, root path, and Common Voice TSV path.

FAQ+