LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model (CVPR2024)

School of Data Science, Fudan University

Abstract

This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter.

This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion.

Overview


Interpolate start reference image.

The formulation of LeftRefill with multi-view setting


Interpolate start reference image.

We utilize the block causal masking for autoregressive novel view synthesis training.


Video of reference-based inpainting

Results of reference-based inpainting

Interpolate start reference image.
Interpolate start reference image.

Results of novel view synthesis

Consistent autoregressive synthesis

Interpolate start reference image.

Multi-view reference-based synthesis

Interpolate start reference image.

Real-world generalization

Interpolate start reference image.
>

BibTeX


      @inproceedings{cao2024leftrefill,
        title={LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model},
        author={Chenjie Cao and Yunuo Cai and Qiaole Dong and Yikai Wang and Yanwei Fu},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        year={2024}
      }