MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

PrePrint

1Fudan University, 2Alibaba DAMO Academy, 3Hupan Lab

Abstract

We introduce MVGenMaster, a multi-view diffusion model enhanced with 3D priors to address versatile Novel View Synthesis (NVS) tasks. MVGenMaster leverages 3D priors that are warped using metric depth and camera poses, significantly enhancing both generalization and 3D consistency in NVS. Our model features a simple yet effective pipeline that can generate up to 100 novel views conditioned on variable reference views and camera poses with a single forward process. Additionally, we have developed a comprehensive large-scale multi-view image dataset called MvD-1M, comprising up to 1.6 million scenes, equipped with well-aligned metric depth to train MVGenMaster. Moreover, we present several training and model modifications to strengthen the model with scaled-up datasets. Extensive evaluations across in- and out-of-domain benchmarks demonstrate the effectiveness of our proposed method and data formulation.

Overview


Interpolate start reference image.

Overall pipeline of MVGenMaster. Inputs can be categorized into reference views (reference images and related camera poses) and target views (camera poses only). For training, we extract monocular depths from reference views and then align them with SfM to warp CCM and RGB pixels as 3D priors for target views. For inference, we utilize Depth-Pro (single-view) or Dust3R (multi-view) to obtain metric depth.

MvD-1M Dataset with 3D Priros


Interpolate start reference image.
Interpolate start reference image.

Results of Novel View Synthesis

1 view to 25 views, CAT3D* is re-implemented by our settings

DL3DV validation

Interpolate start reference image.

CO3Dv2 and MVImgNet validation

Interpolate start reference image.

Zero-shot validation (DTU, MipNeRF360, Tanks-and-Temples)

Interpolate start reference image.

More NVS Applications

Text-to-image to multi-view

Interpolation

Results of 3DGS Reconstruction (3-view)