【CP-力作鼎推】OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
论文链接
Neural scaling laws offer valuable insights for designing robust sequence processing architectures. While these laws have been extensively characterized in other modalities, their behavior in speech remains comparatively underexplored. In this work, we introduce OWLS, an open-access, reproducible suite of multilingual speech recognition and translation models spanning 0.25B to 18B parameters, with the 18B version being the largest speech model, to the best of our knowledge. OWLS leverages up to 360K hours of public speech data across 150 languages, enabling a systematic investigation into how data, model, and compute scaling each influence performance in multilingual speech tasks. We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling. One of our key findings is that scaling enhances performance on low-resource languages/dialects, helping to mitigate bias and improve the accessibility of speech technologies. Finally, we show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models. Model checkpoints will be released on huggingface for future studies.
Conclusions:
- Larger models can mitigate bias and improve the fairness of speech technologies.
- while parameter scaling can significantly improve ST performance, it cannot overcome cases where there is inherently insufficient amounts of data to learn the task.
- data scaling without additional diversity leads to quickly saturated performance.
- small models already exhibit strong performance in phonetically mapping speech to text and larger models exhibit significantly stronger orthographic capabilities.
- scaling can lead to significant reductions in code-switched CER, but the benefits are unevenly distributed.
- larger ASR models are indeed more capable of ”mishearing” in a semantically sound manner.