Disentanglement of Variations with Multimodal Generative Modeling

Yijie Zhang; Yiyang Shen; Weiran Wang

doi:10.48550/arxiv.2509.23548

Back

Disentanglement of Variations with Multimodal Generative Modeling

Preprint

Open access

Disentanglement of Variations with Multimodal Generative Modeling

Yijie Zhang, Yiyang Shen and Weiran Wang

ArXiv.org

Cornell University

09/28/2025

DOI: 10.48550/arxiv.2509.23548

Files and links (1)

url

https://doi.org/10.48550/arxiv.2509.23548View

Preprint (Author's original)This preprint has not been evaluated by subject experts through peer review. Preprints may undergo extensive changes and/or become peer-reviewed journal articles. Open Access

Abstract

Multimodal data are prevalent across various domains, and learning robust representations of such data is paramount to enhancing generation quality and downstream task performance. To handle heterogeneity and interconnections among different modalities, recent multimodal generative models extract shared and private (modality-specific) information with two separate variables. Despite attempts to enforce disentanglement between these two variables, these methods struggle with challenging datasets where the likelihood model is insufficient. In this paper, we propose Information-disentangled Multimodal VAE (IDMVAE) to explicitly address this issue, with rigorous mutual information-based regularizations, including cross-view mutual information maximization for extracting shared variables, and a cycle-consistency style loss for redundancy removal using generative augmentations. We further introduce diffusion models to improve the capacity of latent priors. These newly proposed components are complementary to each other. Compared to existing approaches, IDMVAE shows a clean separation between shared and private information, demonstrating superior generation quality and semantic coherence on challenging datasets.

Computer Science - Artificial Intelligence

Computer Science - Learning

Details

Title: Subtitle: Disentanglement of Variations with Multimodal Generative Modeling
Creators: Yijie Zhang
Yiyang Shen
Weiran Wang
Resource Type: Preprint
Publication Details: ArXiv.org
DOI: 10.48550/arxiv.2509.23548
ISSN: 2331-8422
Publisher: Cornell University; Ithaca, New York
Language: English
Date posted: 09/28/2025
Academic Unit: Computer Science
Record Identifier: 9984966797102771

Metrics

8 Record Views