Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications
Biel Tura Vecino, Adam Gabryś, Daniel Mątwicki, Andrzej Pomirski, Tom Iddon, Marius Cotescu, Jaime Lorenzo-Trueba·May 12, 2025
Summary
A lightweight end-to-end text-to-speech (E2E-TTS) model excels in low-resource, real-time on-device applications, offering 90% smaller parameters and 10× faster real-time-factor compared to current models. Combining LightSpeech and Multi-Band MelGAN, it outperforms two-stage approaches, achieving state-of-the-art performance with a MOS of 3.79 in the LJSpeech dataset, making it ideal for offline on-device applications. The model's efficiency and quality promise real-time, high-quality TTS for on-device use.
Introduction
Background
Overview of text-to-speech (TTS) technology
Importance of lightweight models in low-resource environments
Objective
To introduce a novel lightweight E2E-TTS model that excels in low-resource, real-time on-device applications
Model Architecture
Combining LightSpeech and Multi-Band MelGAN
Description of LightSpeech and Multi-Band MelGAN components
How they are integrated to form the lightweight E2E-TTS model
Key Features
90% smaller parameters
10× faster real-time-factor compared to current models
Performance Evaluation
State-of-the-Art Performance
Evaluation on the LJSpeech dataset
MOS (Mean Opinion Score) of 3.79
Comparison with Two-Stage Approaches
Detailed comparison highlighting the superiority of the proposed model
Applications
Offline On-Device Applications
Suitability for offline on-device use cases
Real-time, high-quality TTS for on-device applications
Conclusion
Summary of the Lightweight E2E-TTS Model
Future Directions
Potential improvements and future research areas
Basic info
papers
sound
audio and speech processing
artificial intelligence
Advanced features
Insights
What are the key implementation strategies that enable the E2E-TTS model to achieve a 10× faster real-time-factor?
How does the combination of LightSpeech and Multi-Band MelGAN contribute to the efficiency of the E2E-TTS model?
How does the E2E-TTS model's performance on the LJSpeech dataset compare to existing models in terms of MOS?
In what ways is the E2E-TTS model optimized for offline on-device applications?