Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts. Our codes, models, and datasets will be available upon acceptance.
Given a reference image, text prompt and concept guidance, we use Concept Extractor to extract the specified concept in the image and then send it to diffusion model to generate corresponding concepts. We use the Contrastive Orthogonal Disentangled Learning constraint model to learn concept disentanglement.
We design three data construction pipelines for the three concepts of content
,
style
, and composition
, each pipeline uses GPT-4o to obtain
reference prompts Iref, target prompts
Itar, and concept guidance Tcg,
and use different models to generate corresponding reference images Iref
and target images Itar.
Given a reference image with various concepts, our OmniPrism achieves superior disentangled generation performance. It not only avoids introducing irrelevant concepts but also ensures the highest concept and prompt fidelity and image quality.
OmniPrism achieves the highest Mask CLIP-I, CLIP-T scores and image quality, and the best balance between Style Similarity and other metrics.
We use latent masks to assign layouts to different concepts to prevent them from conflicting.
@article{Li2024omni,
author = {Yangyang Li and Daqing Liu and Wu Liu and Allen He and Xinchen Liu and Yongdong Zhang and Guoqing Jin},
title = {OmniPrism: Learning Disentangled Visual Concept for Image Generation},
journal = {arXiv preprint arXiv:2412.12242},
year = {2024},
}