SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

Abstract

Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics primarily assess the realism of generated scenes by comparing them to a set of ground-truth scenes, often overlooking alignment with the input text — a critical factor in determining how effectively a method meets user requirements.

We present SceneEval, an evaluation framework designed to address this limitation. SceneEval includes metrics for both explicit user requirements, such as the presence of specific objects and their attributes described in the input text, and implicit expectations, like the absence of object collisions, providing a comprehensive assessment of scene quality. To facilitate evaluation, we introduce SceneEval-100, a dataset of scene descriptions with annotated ground-truth scene properties.

We evaluate recent scene generation methods using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results show that current methods struggle at generating scenes that meet user requirements, underscoring the need for further research in this direction.

Challenges

Scene synthesis faces two primary challenges: adhering to explicit user requirements and meeting implicit expectations, such as physical plausibility, which users often assume but do not explicitly specify. As shown the figure below, both are crucial for practical applications.

Overview

Given a generated scene and its corresponding annotated properties, SceneEval first matches object instances in the scene to the annotated categories. It then evaluates the scene on a comprehensive set of fidelity and plausibility metrics.

Dataset: SceneEval-500

To facilitate evaluation, we introduce SceneEval-500, a dataset of scene descriptions with annotations on the expected scene properties. SceneEval-500 contains 500 scene descriptions covering ten room types: bedroom, living room, dining room, playroom, gaming room, kitchen, bathroom, basement, den, and office. We define three difficulty levels for the scene descriptions — easy, medium, and hard — based on the complexity of the descriptions in terms of the number of objects specified. The figure below shows an example entry of medium difficulty in SceneEval-500. The scene description describes a basement room, a rarely-seen type in existing datasets. The annotation includes the expected scene properties, such as number of objects, specified in the text.

Results

Examples of scenes generated using text descriptions in SceneEval-500 and the corresponding evaluation results using SceneEval. Our dataset has scene descriptions with annotations of three difficulty levels: easy, medium, and hard. SceneEval provides a comprehensive evaluation of the generated scenes on fidelity and plausibility.

BibTeX


      @article{tam2025sceneeval,
          title = {{SceneEval}: Evaluating Semantic Coherence in Text-Conditioned {3D} Indoor Scene Synthesis},
          author = {Tam, Hou In Ivan and Pun, Hou In Derek and Wang, Austin T. and Chang, Angel X. and Savva, Manolis},
          year = {2025},
          eprint = {2503.14756},
          archivePrefix = {arXiv}
      }

Acknowledgements

This work was funded in part by the Sony Research Award Program, a CIFAR AI Chair, a Canada Research Chair, NSERC Discovery Grants, and enabled by support from the Digital Research Alliance of Canada. We thank Nao Yamato, Yotaro Shimose, and other members on the Sony team for their feedback. We also thank Qirui Wu, Xiaohao Sun, and Han-Hung Lee for helpful discussions.