Region control
Scribble masks or bounding boxes steer refinement to the intended local area.
Restore text, logos, and thin structures inside a user-specified region while keeping every pixel outside that region unchanged—supporting both reference-based and reference-free inputs.
We study region-specific image refinement: given an image and a user-specified region (e.g., scribble mask or box), the goal is to recover fine-grained detail while keeping non-edited pixels strictly unchanged. Modern generators still suffer from local detail collapse (distorted text, logos, thin structures), and instruction-driven editors often drift on the background—especially when the target region is small at a fixed input resolution.
RefineAnything is a multimodal refinement model that supports reference-based and reference-free use. It improves fine-grained detail inside the specified region while preserving pixels outside that region, and compares favorably to strong baselines on fidelity and background consistency.
Scribble masks or bounding boxes steer refinement to the intended local area.
Optional reference images enable guided recovery; the same framework supports reference-free refinement.
Edits stay localized; training emphasizes stable context outside the region and natural boundaries.
BibTeX:
@article{refineanything2026,
title = {RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details},
author = {TBD},
year = {2026},
eprint = {2604.06870},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2604.06870},
}