BideDPO

Conditional Image Generation with Simultaneous Text and Condition Alignment

Preprint · Models, code, and benchmarks will be released

BideDPO constructs bidirectionally decoupled preference pairs and balances their influence adaptively, enabling robust multi-constraint alignment across text and conditioning inputs.

View Code on GitHub limuloo/BideDPO
BideDPO teaser

Abstract

Conditional image generation augments text-to-image synthesis with structural, spatial, or stylistic priors and is used in many domains. However, current methods struggle to harmonize guidance from both sources when conflicts arise: (1) input-level conflict, where the semantics of the conditioning image contradict the text prompt, and (2) model-bias conflict, where learned generative biases hinder alignment even when the condition and text are compatible.

Preference-based optimization techniques, such as DPO, offer a promising solution but remain limited: naive DPO suffers from gradient entanglement between text and condition signals and lacks disentangled, conflict-aware training data for multi-constraint tasks. We propose a self-driven, bidirectionally decoupled DPO framework (BideDPO). Our method constructs two disentangled preference pairs for each sample—one for the condition and one for the text—and manages their influence via Adaptive Loss Balancing. We introduce an automated data pipeline with VLM checks to generate disentangled, conflict-aware data, and embed the entire process within an iterative optimization strategy that progressively refines both the model and the data.

We build a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments on common modalities show that BideDPO delivers substantial gains in text success rate (e.g., +35%) and condition adherence, with robustness validated on COCO. Code, models, and benchmarks will be released.

Highlights

Teaser

Qualitative comparison on conflicts between text prompts and conditioning inputs, highlighting dual alignment achieved by BideDPO.

Teaser: dual alignment under conflicts

Method Overview

Overview of BideDPO: bidirectionally decoupled preference pairs for text and condition, Adaptive Loss Balancing, conflict-aware data pipeline with VLM checks, and iterative optimization.

Method overview: bidirectionally decoupled DPO with adaptive balancing

DualAlign Benchmark

Benchmark designed to evaluate conflict resolution between text and condition, measuring text success rate and condition adherence.

DualAlign benchmark results

COCO Results

Evaluation on the COCO dataset demonstrating robustness of BideDPO under multi-constraint alignment.

COCO robustness results

More Examples

Additional cases achieving dual alignment across prompts and conditioning inputs, including abstract style conditions.

Additional dual-alignment examples

Citation

@article{bidedpo2025,
  title   = {BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment},
  author  = {To be updated},
  journal = {arXiv preprint arXiv:xxxx.xxxxx},
  year    = {2025}
}