Research project

Multimodal region-specific refinement for perfect local details

Restore text, logos, and thin structures inside a user-specified region while keeping every pixel outside that region unchanged—supporting both reference-based and reference-free inputs.

Code on GitHub arXiv Paper

Teaser

Abstract

We study region-specific image refinement: given an image and a user-specified region (e.g., scribble mask or box), the goal is to recover fine-grained detail while keeping non-edited pixels strictly unchanged. Modern generators still suffer from local detail collapse (distorted text, logos, thin structures), and instruction-driven editors often drift on the background—especially when the target region is small at a fixed input resolution.

RefineAnything is a multimodal refinement model that supports reference-based and reference-free use. It improves fine-grained detail inside the specified region while preserving pixels outside that region, and compares favorably to strong baselines on fidelity and background consistency.

Highlights

Region control

Scribble masks or bounding boxes steer refinement to the intended local area.

With or without reference

Optional reference images enable guided recovery; the same framework supports reference-free refinement.

Background preservation

Edits stay localized; training emphasizes stable context outside the region and natural boundaries.

Qualitative comparisons

Reference-free comparisons against baselines. — Reference-free setting.

Reference-based comparisons including strong commercial generators. — Reference-based setting.

Citation

BibTeX:

@article{refineanything2026,
  title        = {RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details},
  author       = {TBD},
  year         = {2026},
  eprint       = {2604.06870},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2604.06870},
}