DGMO: Training-Free Audio Source Separation through Diffusion-Guided Mask Optimization

Geonyoung Lee, Geonhee Han, Paul Hongsuck Seo·June 03, 2025

Summary

A study introduces DGMO, a training-free framework for zero-shot Language-queried Audio Source Separation (LASS) using pretrained diffusion models. It proposes Diffusion-Guided Mask Optimization (DGMO) for test-time optimization, refining spectrogram masks for precise, input-aligned separation. This expands diffusion models' application beyond generation, achieving competitive performance without task-specific supervision. The text reviews language-queried audio source separation, discussing advancements in universal sound source separation using vision, audio, labels, and language queries. It highlights diffusion models' success in text-to-audio tasks, noting their use in realistic audio synthesis, test-time optimization, editing, and refining signals.

Introduction
Background
Overview of Language-queried Audio Source Separation (LASS)
Importance of zero-shot learning in LASS
Role of pretrained diffusion models in audio processing
Objective
Aim of the study: introducing DGMO for LASS
Focus on test-time optimization using diffusion-guided mask optimization (DGMO)
Method
Data Collection
Utilization of pretrained diffusion models for LASS
Data Preprocessing
Preparation of spectrogram masks for refinement
Test-time Optimization
Description of Diffusion-Guided Mask Optimization (DGMO)
Process of refining spectrogram masks for precise separation
Advancements in Language-queried Audio Source Separation
Universal Sound Source Separation
Integration of vision, audio, labels, and language queries
Diffusion Models in Text-to-Audio Tasks
Success in realistic audio synthesis
Application in test-time optimization, editing, and signal refinement
Conclusion
Summary of DGMO's contributions
Future directions and implications
Comparison with existing methods
Basic info
papers
sound
artificial intelligence
Advanced features
Insights
How does DGMO expand the application of diffusion models beyond traditional generative tasks?
What is the core innovation of Diffusion-Guided Mask Optimization (DGMO) in the context of audio source separation?
What are the key components and processes involved in refining spectrogram masks within the DGMO framework?
How does DGMO leverage pretrained diffusion models for zero-shot language-queried audio source separation?