Conditional Image Generation with Simultaneous Text and Condition Alignment

Preprint · Models, code, and benchmarks will be released

BideDPO constructs bidirectionally decoupled preference pairs and balances their influence adaptively, enabling robust multi-constraint alignment across text and conditioning inputs.

View Code on GitHub limuloo/BideDPO

Abstract

Conditional image generation augments text-to-image synthesis with structural, spatial, or stylistic priors and is used in many domains. However, current methods struggle to harmonize guidance from both sources when conflicts arise: (1) input-level conflict, where the semantics of the conditioning image contradict the text prompt, and (2) model-bias conflict, where learned generative biases hinder alignment even when the condition and text are compatible.

Preference-based optimization techniques, such as DPO, offer a promising solution but remain limited: naive DPO suffers from gradient entanglement between text and condition signals and lacks disentangled, conflict-aware training data for multi-constraint tasks. We propose a self-driven, bidirectionally decoupled DPO framework (BideDPO). Our method constructs two disentangled preference pairs for each sample—one for the condition and one for the text—and manages their influence via Adaptive Loss Balancing. We introduce an automated data pipeline with VLM checks to generate disentangled, conflict-aware data, and embed the entire process within an iterative optimization strategy that progressively refines both the model and the data.

We build a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments on common modalities show that BideDPO delivers substantial gains in text success rate (e.g., +35%) and condition adherence, with robustness validated on COCO. Code, models, and benchmarks will be released.

Highlights

Bidirectionally decoupled preference pairs for text and condition.
Adaptive Loss Balancing to avoid gradient entanglement.
Self-driven data pipeline with VLM checks for conflict-aware pairs.
Iterative optimization that refines both model and data.
DualAlign benchmark for evaluating text–condition conflict resolution.
Strong gains in text success rate and condition adherence; robust on COCO.

Teaser

Qualitative comparison on conflicts between text prompts and conditioning inputs, highlighting dual alignment achieved by BideDPO.

Method Overview

Overview of BideDPO: bidirectionally decoupled preference pairs for text and condition, Adaptive Loss Balancing, conflict-aware data pipeline with VLM checks, and iterative optimization.

DualAlign Benchmark

Benchmark designed to evaluate conflict resolution between text and condition, measuring text success rate and condition adherence.

COCO Results

Evaluation on the COCO dataset demonstrating robustness of BideDPO under multi-constraint alignment.

More Examples

Additional cases achieving dual alignment across prompts and conditioning inputs, including abstract style conditions.

Citation

@article{bidedpo2025,
  title   = {BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment},
  author  = {To be updated},
  journal = {arXiv preprint arXiv:xxxx.xxxxx},
  year    = {2025}
}