MuSHRoom: dataset used for joint 3D Reconstruction and Novel View Synthesis

Abstract

Metaverse technologies demand accurate, real-time, and immersive modeling on consumer-grade hardware for both non-human perception (e.g., drone/robot/autonomous car navigation) and immersive technologies like AR/VR, requiring both structural accuracy and photorealism. However, there exists a knowledge gap in how to apply geometric reconstruction and photorealism modeling (novel view synthesis) in a unified framework. To address this gap and promote the development of robust and immersive modeling and rendering with consumer-grade devices, we propose a real-world Multi-Sensor Hybrid Room Dataset (MuSHRoom). Our dataset presents exciting challenges and requires state-of-the-art methods to be cost-effective, robust to noisy data and devices, and can jointly learn 3D reconstruction and novel view synthesis instead of treating them as separate tasks, making them ideal for real-world applications. We benchmark several famous pipelines on our dataset for joint 3D mesh reconstruction and novel view synthesis. Our dataset and benchmark show great potential in promoting the improvements for fusing 3D reconstruction and high-quality rendering in a robust and computationally efficient end-to-end fashion.

Dataset Overview and download

The dataset can be downloaded from here. We also provide the codes to process the dataset.

Kinect Spectacular AI Point Cloud

iPhone PolyCam Point Cloud

Reference mesh

Our dataset contains 10 rooms captured by Kinect, iPhone and Faro scanner. We provide the raw video, extracted rgb image, raw depth, completed depth, Spectacular AI/polycam pose, Spectacular AI/polycam point cloud of each keyframe, as well as ground truth mesh for each room.

The dataset structure is as follows:

<room_name>
  | —— kinect
    | —— long_capture
        — images/ # extracted rgb images of keyframe
        — depth/ # extracted depth images of keyframe
        — intrinsic/ # intrinsic parameters
        — PointCloud/ # Spectacular AI point cloud of keyframe
        — pose/  # Spectacular AI pose of keyframe. These poses are aligned with the metric of depth.
        — calibration.json; data.jsonl; data.mkv; data2.mkv; vio_config.yaml    # raw videos and parameters from Spectacular AI SDK
        — test.txt # image name for testing within a single sequence
        — transformations_colmap.json # global optimized colmap used for testing with a different sequence
        — transformations.json  # SpectacularAI pose saved in the json file. Poses are in the OPENGL coordination. 
    | —— short_capture
          — images/ # same with long capture
          — depth/    # same with long capture
          — PointCloud/   # same with long capture
          — pose/ # same with long capture
          — intrinsic/ # same with long capture
          — calibration.json; data.jsonl; data.mkv; data2.mkv; vio_config.yaml    # same with long capture 
          — transformations_colmap.json # same with long capture
          — transformations.json  # same with long capture 
  | —— iphone
      | —— long_capture
          — images/   # same with Kinect
          — depth/    # same with Kinect
          — polycam_mesh/     # mesh provided by polycam
          — polycam_pointcloud.ply    # point cloud provided by polycam
          — test.txt  # same with Kinect
          — transformations_colmap.json   # same with Kinect
          — transformations.json  # Polycam pose
      | —— short_capture
          — images/   # same with Kinect
          — depth/    # same with Kinect
          — transformations_colmap.json   # same with Kinect
          — transformations.json  # same with long capture
  —— gt_mesh.ply  # reference mesh used for geometry comparison
  —— gt_pd.ply  # reference point cloud used for geometry comparison
  —— icp_iphone.json  # aligned transformation matrix used for iPhone sequences
  —— icp_kinect.json  # aligned transformation matrix used for kinect sequences

Data Process pipeline

The procedures for recording real-world indoor room data using the Kinect, iPhone, and Faro scanner.

Kinect and iPhone are used to obtain RGB-D images as inputs, and the Faro scanner is used to capture the reference 3D point cloud for geometry reference. In real-world VR/AR applications, users scan an entire room with a device and then wear AR glasses to interact with the environment from random positions and directions. To mimic this real-life scenario, each room is recorded with a long sequence for training and a shorter sequence for testing using both the Kinect and iPhone.

Novel view synthesis and 3D Reconstruction Results

We show our 3D reconstruction results and novel view synthesis results compared with NeuS-facto(Zehao et al. 2022) and Nerfacto(Matthew et al. 2023). Our method provides a trade-off between the two tasks.

Novel view synthesis visualization

VR room with Kinect sequence

JuxtaposeJS

Kokko with Kinect sequence

JuxtaposeJS

Mesh visualization

Our reconstruction with Kinect sequence of coffee room

Our reconstruction with iPhone sequence of coffee room

Conclusion

We have proposed a real-world dataset and a new benchmark with multiple sensors for evaluating pipelines on both 3D reconstruction accuracy and novel view synthesis quality. The new dataset poses more realistic challenges and supports more practical evaluation. With consumer-grade devices to collect inputs, pipelines are encouraged to be robust, generalized, and computationally efficient. We also propose a new method and evaluate it with several popular pipelines, revealing the aim to realize both geometry accuracy and immersion still has a long way to go. Our dataset can serve as a foundation for the development of a unified framework training in an end-to-end fashion.

Acknowledgement

This research was supported by the Academy of Finland. This work was carried out also with the support of Centre for Immersive Visual Technologies (CIVIT) research infrastructure, Tampere University, Finland. We thank Valtteri Kaatrasalo for his guidance in using the Spectacular AI SDK.

BibTeX

@misc{ren2023mushroom,
      title={MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis}, 
      author={Xuqian Ren and Wenjia Wang and Dingding Cai and Tuuli Tuominen and Juho Kannala and Esa Rahtu},
      year={2023},
      eprint={2311.02778},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

MuSHRoom: Multi-Sensor Hybrid Room Dataset for Joint 3D Reconstruction and Novel View Synthesis

WACV 2024