Omnilingual ASR — the automatic speech recognition system launched by Meta AI

AI Tools updated 14h ago dongdong
43 0

What is Omnilingual ASR?

Omnilingual ASR is an automatic speech recognition system launched by Meta AI, supporting more than 1,600 languages, including 500 low-resource languages. By scaling the wav2vec 2.0 encoder to 7 billion parameters and introducing two types of decoders, the system achieves outstanding performance, with 78% of languages reaching a character error rate (CER) below 10%.
The Omnilingual ASR framework is community-driven—users can extend the system to new languages by providing only a small number of audio–text samples. Meta has also open-sourced the Omnilingual ASR Corpus and the new Omnilingual wav2vec 2.0 self-supervised multilingual speech representation model, accelerating global speech technology and promoting linguistic equality and cultural exchange.

Omnilingual ASR — the automatic speech recognition system launched by Meta AI


Key Features of Omnilingual ASR

  • Multilingual speech transcription:
    Converts speech to text in more than 1,600 languages, including many low-resource and previously unsupported languages.

  • Community extensibility:
    Users can add new languages by supplying a small amount of audio and text samples—no large datasets or specialized expertise required.

  • High performance and low error rate:
    Achieves CER below 10% for 78% of supported languages, reaching industry-leading accuracy.

  • Multiple model options:
    Offers models from lightweight 300M versions to powerful 7B versions, suitable for different devices and use cases.

  • Open source and data sharing:
    Provides open access to the Omnilingual wav2vec 2.0 model and the Omnilingual ASR Corpus to support developers and researchers worldwide.


Technical Principles Behind Omnilingual ASR

  • wav2vec 2.0 scaling:
    Expands the wav2vec 2.0 encoder to 7 billion parameters, enabling extraction of rich multilingual semantic representations directly from raw audio.

  • Dual-decoder architecture:
    Uses two decoders—CTC (Connectionist Temporal Classification) and a Transformer-based decoder inspired by LLMs—to significantly improve performance on long-tail languages.

  • In-context learning capability:
    Inspired by large language models, the system can quickly adapt to new languages using only a few in-context samples, without large-scale retraining.

  • Large-scale multilingual dataset:
    The training corpus integrates public datasets and community-contributed audio, covering many low-resource languages to give the model broad linguistic grounding.


Project Links


Application Scenarios of Omnilingual ASR

  • Cross-lingual communication:
    Enables real-time voice communication across different languages, removing language barriers and fostering global collaboration and cultural exchange.

  • Low-resource language preservation:
    Provides high-quality transcription tools for endangered or low-resource languages, supporting preservation and revitalization efforts.

  • Education and learning:
    Assists multilingual teaching, pronunciation practice, and real-time translation for language learners.

  • Voice assistant expansion:
    Extends language support for intelligent voice assistants, enabling them to serve a broader global user base.

  • Content creation and media:
    Automatically transcribes multilingual audio and video content, improving productivity and supporting multilingual subtitle generation.

© Copyright Notice

Related Posts

No comments yet...

none
No comments yet...