SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations

Youngjun Sim, Jinsung Yoon, Young-Joo Suh·November 25, 2024

Summary

The SKQVC paper presents a one-shot voice conversion model using K-means quantization and self-supervised speech representations. It addresses the challenge of preserving speaking variation with smaller codebooks, enabling high-fidelity voice conversion through reconstruction losses. The model outperforms three baseline models across six evaluation metrics, demonstrating superior performance in naturalness, intelligibility, and similarity. A disentanglement architecture with WavLM encoder, K-means quantization, and a disentangler for content, speaker, and speaking variation embeddings is employed. The method surpasses alternatives in WER, CER, and EER scores, showcasing robustness across unseen datasets.

Key findings

Tables

Advanced features