Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

Maohao Shen, Shun Zhang, Jilong Wu, Zhiping Xiu, Ehab AlBadawy, Yiting Lu, Mike Seltzer, Qing He·October 27, 2024

Summary

TTS-Llama and MoLE-Llama are advanced text-to-speech systems. TTS-Llama, using a fine-tuned Llama model, excels in speech synthesis. MoLE-Llama, a multimodal language model, integrates text and speech through late-fusion, addressing catastrophic forgetting and outperforming TTS-Llama on text QA tasks. Both models demonstrate potential in text-speech multimodal applications, showcasing improvements in speech generation and text normalization.

Key findings

Tables

Advanced features