DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation
Hao Phung, Quan Dao, Trung Dao, Hoang Phan, Dimitris Metaxas, Anh Tran·November 06, 2024
Summary
DiMSUM introduces a novel state-space architecture for diffusion models, integrating wavelet transformation into Mamba to enhance local structure awareness and capture long-range frequency relations. This method combines spatial and frequency information through a cross-attention fusion layer, improving state-space models' order awareness. A globally-shared transformer boosts Mamba's performance, capturing global relationships. DiMSUM outperforms DiT and DIFFUSSM, achieving faster training convergence and high-quality outputs on standard benchmarks.
Advanced features