OmniPrism: Learning Disentangled Visual Concept for Image Generation

1University of Science and Technology of China, 2JD Explore Academy, JD.com Inc. 3State Key Laboratory of Communication Content Cognition, People’s Daily Online
Project lead,  Corresponding author, 
Interpolate start reference image.

We propose OmniPrism, which arbitrarily disentangles and combines visual concepts. (a) Disentangled visual concept generation. Given a reference image with multiple concepts, our method can disentangle the desired concept guided by natural language such as content names (red color words in prompts), “style” or “composition” (e.g., relation or structural features like pose) while remaining faithful to prompts. (b) Multi-concept combination. Given two or more reference images with the corresponding concept guidance, our approach can combine all desired concepts in any combination without conflicts.

Based on OmniPrism, we can achieve multiple downstream applications such as:

  • content disentanglement, e.g., subject customization
  • style disentanglement, e.g., stylized generation
  • composition disentanglement, e.g., spatial control or relationship customization
  • content + style, e.g., subject stylization
  • content + composition, e.g., subject pose control
  • etc.
In addition to these flexible applications shown in Main Results, we can also demonstrate the potentials on more creative applications, such as multi-content combination, concept blending, and combine with ControlNet.

Abstract

Creative visual concept generation often draws inspiration from specific concepts in a reference image to produce relevant outcomes. However, existing methods are typically constrained to single-aspect concept generation or are easily disrupted by irrelevant concepts in multi-aspect concept scenarios, leading to concept confusion and hindering creative generation. To address this, we propose OmniPrism, a visual concept disentangling approach for creative image generation. Our method learns disentangled concept representations guided by natural language and trains a diffusion model to incorporate these concepts. We utilize the rich semantic space of a multimodal extractor to achieve concept disentanglement from given images and concept guidance. To disentangle concepts with different semantics, we construct a paired concept disentangled dataset (PCD-200K), where each pair shares the same concept such as content, style, and composition. We learn disentangled concept representations through our contrastive orthogonal disentangled (COD) training pipeline, which are then injected into additional diffusion cross-attention layers for generation. A set of block embeddings is designed to adapt each block's concept domain in the diffusion models. Extensive experiments demonstrate that our method can generate high-quality, concept-disentangled results with high fidelity to text prompts and desired concepts. Our codes, models, and datasets will be available upon acceptance.

Method

Framework

Interpolate start reference image.

Given a reference image, text prompt and concept guidance, we use Concept Extractor to extract the specified concept in the image and then send it to diffusion model to generate corresponding concepts. We use the Contrastive Orthogonal Disentangled Learning constraint model to learn concept disentanglement.

Paired Concept Disentangled Dataset

Interpolate start reference image.

We design three data construction pipelines for the three concepts of content, style, and composition, each pipeline uses GPT-4o to obtain reference prompts Iref, target prompts Itar, and concept guidance Tcg, and use different models to generate corresponding reference images Iref and target images Itar.

Main Results

Interpolate start reference image.

Comparisons with State-of-the-Arts

Interpolate start reference image.

Given a reference image with various concepts, our OmniPrism achieves superior disentangled generation performance. It not only avoids introducing irrelevant concepts but also ensures the highest concept and prompt fidelity and image quality.

Interpolate start reference image.

OmniPrism achieves the highest Mask CLIP-I, CLIP-T scores and image quality, and the best balance between Style Similarity and other metrics.

Applications

Multi-Content Combinations

Interpolate start reference image.

We use latent masks to assign layouts to different concepts to prevent them from conflicting.

Concept Blending

Interpolate start reference image.

OmniPrism with ControlNet

Interpolate start reference image.

More Result

Different Base model

Interpolate start reference image.

Content Disentangle Results

Interpolate start reference image.

Style Disentangle Results

Interpolate start reference image.

Composition Disentangle Results

Interpolate start reference image.

BibTeX

@article{Li2024omni,
  author    = {Yangyang Li and Daqing Liu and Wu Liu and Allen He and Xinchen Liu and Yongdong Zhang and Guoqing Jin},
  title     = {OmniPrism: Learning Disentangled Visual Concept for Image Generation},
  journal   = {arXiv preprint arXiv:2412.12242},
  year      = {2024},
}