Overview
CoVoST (Conversational Voice-to-Speech Translation) is a large-scale, multilingual speech-to-text translation corpus developed by Facebook Research. It addresses the lack of parallel data for end-to-end speech translation (ST) model training. Built upon the Common Voice dataset, CoVoST includes translations from English into 15 languages and from 21 languages into English. The corpus comprises approximately 2,880 hours of speech data from 78,000 speakers. It is designed to foster ST research by providing a diversified, openly licensed dataset. CoVoST facilitates the training of end-to-end ST models, which offer system simplicity, lower inference latency, and reduced compounding errors compared to cascaded ST systems. Data splitting scripts and Fairseq S2T examples are provided to facilitate model training.
