MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

1Fudan University, 2Alibaba DAMO Academy, 3Hupan Lab

Abstract

Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements. However, these works mainly focus on confined categories or synthetic 3D assets, which are discouraged from generalizing to challenging in-the-wild scenes and fail to be employed with 2D synthesis directly. Moreover, these methods heavily depended on camera poses, limiting their real-world applications. To overcome these issues, we propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task. Specifically, MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions. To ensure cross-view consistency, MVInpainter is enhanced by video priors from motion components and appearance guidance from concatenated reference key\&value attention. Furthermore, MVInpainter incorporates slot attention to aggregate high-level optical flow features from unmasked regions to control the camera movement with pose-free training and inference. Sufficient scene-level experiments on both object-centric and forward-facing datasets verify the effectiveness of MVInpainter, including diverse tasks, such as multi-view object removal, synthesis, insertion, and replacement.

Overview


Interpolate start reference image.

(a) The overview of the proposed MVInpainter. MVInpainter-O is trained on object-centric data, while MVInpainter-F is trained on forward-facing data with a shared SD-inpainting backbone of different LoRA/motion weights and masking strategies. The object-centric MVInpainter focuses on the object-level NVS, while the forward-facing one is devoted to object removal and scene-level inpainting. (b) The Ref-KV is used in spatial self-attention blocks of denoising U-Net. (c) The slot-attention based flow grouping module is used to learn implicit pose features. Dashed boxes in (b) and (c) mean feature concatenation.

Masking Adaption


Interpolate start reference image.

To confirm the mask shape for inference, we propose the masking adaption, starting with a simple 4-point bottom face of the object. We apply the perspective warping through the dense matching of the basic plane to warp mask with correct shapes.


Results of Scene Editing

Results of multi-view inpainted images

Results of 3DGS

Results of multi-view inpainting (object removal)

Interpolate start reference image.

Results of object-level novel view synthesis

Interpolate start reference image.


Results of replacement

Interpolate start reference image.


Generalization of mask adaption

Interpolate start reference image.
>