SemLayoutDiff

We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.

Data Representation

We represent the scene in two parts: the 2D top‑down semantic map with a fixed physical unit per pixel (middle) and the 3D bounding boxes with orientations for object‑level attributes (right).

Note: The left image is the corresponding rendered scene.

Method Overview

From left: (a) is the unified diffusion model that is conditioned on the room mask, and room type c_room. During the denoising process, the arch mask or floor mask embedding is added to the noise input embedding. The room type embedding is added to the timestep embedding. (b) is the object attribute prediction model with a semantic layout map as input. s_i, p_i, r_i indicate the ith instance's size, position, and orientation. At training time, we use ground-truth instance masks. During inference the semantic layout map is split into instance masks by using connected component analysis. The layout feature and the mask feature are passed to a cross-attention layer to get the final object instance feature, which is used to predict attributes. (c) During inference, objects are retrieved to match the category c_i and size s_i, and arranged using the position p_i and orientation r_i. Our SemLayoutDiff generates scenes with fewer errors and respects architectural constraints by keeping furniture within room boundaries and maintaining clear spaces around doors (red) and windows (pink), whereas DiffuScene and MiDiffusion do not.

Results

Qualitative Results

Examples of generated scenes using our SemLayoutDiff mixed-condition model under different condition types from top-down view with Blender rendering. The condition room mask is shown on the top-left of each generated scene (for arch mask, door is red and window is pink).

More Qualitative Results for SemLayoutDiff

Sample results showing diverse textured scene layouts generated by our mixed-condition model.

BibTeX


        @article{sun2025semlayoutdiff,
          title={{SemLayoutDiff}: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis}, 
          author={Xiaohao Sun and Divyam Goel and Angel X. Chang},
          year={2025},
          eprint={2508.18597},
          archivePrefix={arXiv},
        }

Acknowledgements

This work was funded in part by a CIFAR AI Chair and NSERC Discovery Grants, and enabled by support from the Digital Research Alliance of Canada and a CFI/BCKDF JELF. We thank Ivan Tam for help with running SceneEval; Yiming Zhang and Jiayi Liu for suggestions on figures; Derek Pun, Dongchen Yang, Xingguang Yan, and Manolis Savva for discussions, proofreading, and paper suggestions. We also thank the anonymous reviewers for their feedback.

SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis