Accepted by ICLR 2026

BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment

BideDPO constructs bidirectionally decoupled preference pairs and balances their influence adaptively, enabling robust multi-constraint alignment across text and conditioning inputs.

Dewei Zhou Mingwei Li Zongxin Yang Yu Lu Yunqiu Xu Zhizhong Wang Zeyi Huang Yi Yang
BideDPO teaser showing dual alignment under conflicts

News

2026.01
BideDPO has been accepted by ICLR 2026! We will release all models, code, and benchmarks soon.

Abstract

Conditional image generation augments text-to-image synthesis with structural, spatial, or stylistic priors and is used in many domains. However, current methods struggle to harmonize guidance from both sources when conflicts arise: (1) input-level conflict, where the semantics of the conditioning image contradict the text prompt, and (2) model-bias conflict, where learned generative biases hinder alignment even when the condition and text are compatible.

Preference-based optimization techniques, such as DPO, offer a promising solution but remain limited: naive DPO suffers from gradient entanglement between text and condition signals and lacks disentangled, conflict-aware training data for multi-constraint tasks. We propose a self-driven, bidirectionally decoupled DPO framework (BideDPO). Our method constructs two disentangled preference pairs for each sample—one for the condition and one for the text—and manages their influence via Adaptive Loss Balancing.

We introduce an automated data pipeline with VLM checks to generate disentangled, conflict-aware data, and embed the entire process within an iterative optimization strategy that progressively refines both the model and the data. We build a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments on common modalities show that BideDPO delivers substantial gains in text success rate (+35%) and condition adherence, with robustness validated on COCO.

Highlights

Bidirectionally Decoupled preference pairs for text and condition signals.
Adaptive Loss Balancing to avoid gradient entanglement.
Self-driven data pipeline with VLM checks for conflict-aware pairs.
Iterative optimization that refines both model and data together.
DualAlign benchmark for evaluating text–condition conflict resolution.
+35% text success rate gains and robust on COCO dataset.

Method Overview

Overview of BideDPO: bidirectionally decoupled preference pairs for text and condition, Adaptive Loss Balancing, conflict-aware data pipeline with VLM checks, and iterative optimization.

Method overview: bidirectionally decoupled DPO with adaptive balancing
Figure: BideDPO framework with bidirectionally decoupled preference optimization.

Results

DualAlign Benchmark

Benchmark designed to evaluate conflict resolution between text and condition, measuring text success rate and condition adherence.

DualAlign benchmark results
Table: Quantitative results on the DualAlign benchmark.

COCO Results

Evaluation on the COCO dataset demonstrating robustness of BideDPO under multi-constraint alignment.

COCO robustness results
Table: Performance comparison on the COCO dataset.

Visual Examples

Additional cases achieving dual alignment across prompts and conditioning inputs, including abstract style conditions.

Additional dual-alignment examples
Figure: Qualitative comparison showing improved text-condition alignment.

Citation

If you find BideDPO useful in your research, please consider citing our paper.

BibTeX
@misc{zhou2025bidedpo,
  title={BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment}, 
  author={Dewei Zhou and Mingwei Li and Zongxin Yang and Yu Lu and Yunqiu Xu and Zhizhong Wang and Zeyi Huang and Yi Yang},
  year={2025},
  eprint={2511.19268},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2511.19268}, 
}