Despite advances in indoor 3D scene layout generation, synthesizing scenes with dense object arrangements remains challenging. Existing methods focus on large furniture while neglecting smaller objects, resulting in unrealistically empty scenes. Those that place small objects typically do not honor arrangement specifications, resulting in largely random placement not following the text description.
We present Hierarchical Scene Motifs (HSM): a hierarchical framework for indoor scene generation with dense object arrangements across spatial scales. Indoor scenes are inherently hierarchical, with surfaces supporting objects at different scales, from large furniture on floors to smaller objects on tables and shelves. HSM embraces this hierarchy and exploits recurring cross-scale spatial patterns to generate complex and realistic scenes in a unified manner. Our experiments show that HSM outperforms existing methods by generating scenes that better conform to user input across room types and spatial configurations.
Given a room description and optional room boundary as input, HSM decomposes indoor scenes hierarchically and identifies valid support regions (highlighted in pink boxes) at each level of the hierarchy. The system then populates these regions by generating and optimizing scene motifs in a unified manner across scales, generating scenes with dense object arrangements.
Objects and spatial relationships in the input text are highlighted with colors and underlines, and spatial relationships are emphasized using boxes. HSM produces more coherent spatial arrangements and is better aligned to the input compared to existing approaches.
Each row shows close-up views of object arrangements from the input description. HSM better follows the spatial relationships and placement instructions specified in the input text.
@article{pun2025hsm,
title = {{HSM}: Hierarchical Scene Motifs for Multi-Scale Indoor Scene Generation},
author = {Pun, Hou In Derek and Tam, Hou In Ivan and and Wang, Austin T. and Huo, Xiaoliang and Chang, Angel X. and Savva, Manolis},
year = {2025},
eprint = {2503.16848},
archivePrefix = {arXiv}
}
This work was funded in part by a CIFAR AI Chair, a Canada Research Chair, NSERC Discovery Grants, and enabled by support from the Digital Research Alliance of Canada. We also thank Jiayi Liu, Weikun Peng, and Qirui Wu for helpful discussions.