BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment
BideDPO constructs bidirectionally decoupled preference pairs and balances their influence adaptively, enabling robust multi-constraint alignment across text and conditioning inputs.
News
Abstract
Conditional image generation augments text-to-image synthesis with structural, spatial, or stylistic priors and is used in many domains. However, current methods struggle to harmonize guidance from both sources when conflicts arise: (1) input-level conflict, where the semantics of the conditioning image contradict the text prompt, and (2) model-bias conflict, where learned generative biases hinder alignment even when the condition and text are compatible.
Preference-based optimization techniques, such as DPO, offer a promising solution but remain limited: naive DPO suffers from gradient entanglement between text and condition signals and lacks disentangled, conflict-aware training data for multi-constraint tasks. We propose a self-driven, bidirectionally decoupled DPO framework (BideDPO). Our method constructs two disentangled preference pairs for each sample—one for the condition and one for the text—and manages their influence via Adaptive Loss Balancing.
We introduce an automated data pipeline with VLM checks to generate disentangled, conflict-aware data, and embed the entire process within an iterative optimization strategy that progressively refines both the model and the data. We build a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments on common modalities show that BideDPO delivers substantial gains in text success rate (+35%) and condition adherence, with robustness validated on COCO.
Highlights
Method Overview
Overview of BideDPO: bidirectionally decoupled preference pairs for text and condition, Adaptive Loss Balancing, conflict-aware data pipeline with VLM checks, and iterative optimization.
Results
DualAlign Benchmark
Benchmark designed to evaluate conflict resolution between text and condition, measuring text success rate and condition adherence.
COCO Results
Evaluation on the COCO dataset demonstrating robustness of BideDPO under multi-constraint alignment.
Visual Examples
Additional cases achieving dual alignment across prompts and conditioning inputs, including abstract style conditions.
Citation
If you find BideDPO useful in your research, please consider citing our paper.
@misc{zhou2025bidedpo,
title={BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment},
author={Dewei Zhou and Mingwei Li and Zongxin Yang and Yu Lu and Yunqiu Xu and Zhizhong Wang and Zeyi Huang and Yi Yang},
year={2025},
eprint={2511.19268},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.19268},
}