Preprint · Models, code, and benchmarks will be released
BideDPO constructs bidirectionally decoupled preference pairs and balances their influence adaptively, enabling robust multi-constraint alignment across text and conditioning inputs.
Conditional image generation augments text-to-image synthesis with structural, spatial, or stylistic priors and is used in many domains. However, current methods struggle to harmonize guidance from both sources when conflicts arise: (1) input-level conflict, where the semantics of the conditioning image contradict the text prompt, and (2) model-bias conflict, where learned generative biases hinder alignment even when the condition and text are compatible.
Preference-based optimization techniques, such as DPO, offer a promising solution but remain limited: naive DPO suffers from gradient entanglement between text and condition signals and lacks disentangled, conflict-aware training data for multi-constraint tasks. We propose a self-driven, bidirectionally decoupled DPO framework (BideDPO). Our method constructs two disentangled preference pairs for each sample—one for the condition and one for the text—and manages their influence via Adaptive Loss Balancing. We introduce an automated data pipeline with VLM checks to generate disentangled, conflict-aware data, and embed the entire process within an iterative optimization strategy that progressively refines both the model and the data.
We build a DualAlign benchmark to evaluate conflict resolution between text and condition. Experiments on common modalities show that BideDPO delivers substantial gains in text success rate (e.g., +35%) and condition adherence, with robustness validated on COCO. Code, models, and benchmarks will be released.
Overview of BideDPO: bidirectionally decoupled preference pairs for text and condition, Adaptive Loss Balancing, conflict-aware data pipeline with VLM checks, and iterative optimization.

Benchmark designed to evaluate conflict resolution between text and condition, measuring text success rate and condition adherence.

Evaluation on the COCO dataset demonstrating robustness of BideDPO under multi-constraint alignment.

Additional cases achieving dual alignment across prompts and conditioning inputs, including abstract style conditions.

@article{bidedpo2025,
title = {BideDPO: Conditional Image Generation with Simultaneous Text and Condition Alignment},
author = {To be updated},
journal = {arXiv preprint arXiv:xxxx.xxxxx},
year = {2025}
}