TL;DR: Our work is driven by the question "Is holistic 3D scene modeling from a single-view real-world image possible using foundation models?" To answer it, we present Diorama: a modular zero-shot open-world system that models synthetic holistic 3D scenes given an image and requires no end-to-end training.
Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains.
We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce robust, generalizable solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to internet images and the text-to-scene task.
Our system has two major components: (1) Open-world perception for holistic scene understanding of the input image, including object recognition and localization, depth and normal estimation, architecture reconstruction and scene graph generation (orange box). (2) CAD-based scene modeling for assembly of a clean and compact 3D scene representation through CAD model retrieval, 9-DoF pose estimation, and semantic-aware scene layout optimization (green box).
As a modular system, its performance can be naturally improved by replacing each individual component with a dedicated approach of better performance. We choose different components based on the principles of open vocabulary, category agnosticism, and robustness to non-exact object matches. The reconstructed scenes benefit from certain designs of Diorama, including scene graph for maintaining spatial relationships among objects, shape retrieval for multiple semantically similar arrangements, planar architecture for physically plausible supporting objects, and layout optimization for refining objects poses based on spatial relationships.
We demonstrate the feasibility of our system on synthetic images for evaluation. And also show generalization to in-the-wild internet images and the text2scene task where an image input is generated from a text prompt.
Specifically, we evaluate on stanford scene database scenes, which contains highly cluttered arrangements ranging from office desk setups to chemistry laboratories. We show similar semantic arrangements using different retrieved 3D shapes. Refer to the paper for detailed quantitative results.
@article{wu2024diorama,
title={Diorama: Unleashing Zero-shot Single-view 3D Scene Modeling},
author={Wu, Qirui and Iliash, Denys and Ritchie, Daniel and Savva, Manolis and Chang, Angel X},
journal={arXiv preprint arXiv:2411.19492},
year={2024}
}