SINGAPO: Single Image Controlled Generation of Articulated Parts in Object

Abstract

We address the challenge of creating 3D assets for household articulated objects from a single image. Prior work on articulated object creation either requires multi-view multi-state input, or only allows coarse control over the generation process. These limitations hinder the scalability and practicality for articulated object modeling.

In this work, we propose a method to generate articulated objects from a single image. Observing the object in a resting state from an arbitrary view, our method generates an articulated object that is visually consistent with the input image. To capture the ambiguity in part shape and motion posed by a single view of the object, we design a diffusion model that learns the plausible variations of objects in terms of geometry and kinematics.

To tackle the complexity of generating structured data with attributes in multiple domains, we design a pipeline that produces articulated objects from high-level structure to geometric details in a coarse-to-fine manner, where we use a part connectivity graph and part abstraction as proxies.

Our experiments show that our method outperforms the state-of-the-art in articulated object creation by a large margin in terms of the generated object realism, resemblance to the input image, and reconstruction quality.

Methods Overview

Our method takes an object image as input and generates attributes of articulated parts, which are used to assemble the object via part mesh retrieval. We design a DDPM-based model for part generation, where each part is represented by a set of shape and motion attributes. The generation is guided by (1) a part connectivity graph extracted using GPT-4o and (2) image patch features of the input image encoded by DINOv2.

Our denoising network is built on layers of attention blocks, each of which consists of three self-attentions with different masking strategies and one cross-attention modules. The graph constraint is injected into the graph relation module by converting to an adjacency matrix as the attention mask. The patch features act as the keys and values in the image cross attention (ICA) to condition the part arrangement.

It is interesting to observe that each part is learned to focus on its relevant patches in the image during cross attention (as shown in the visualization below), indicating that each part is anchoring a specific region of the image during generation. This part-patch correspondence is learned without any explicit supervision during training, eliminating the need for taking part detection or segmentation as input.

Qualitative Comparison

We compare our method with (1) URDFormer and (2) NAP-ICA (we plug in our ICA module to NAP to enable NAP with image-conditioning capability). We train all methods on the same training data collected from PartNet-Mobility dataset with several augmentation strategies (please see paper appendix for more details).

Here we show the comparison on the PartNet-Mobility dataset. Incorrect prediction part connectivity graph is denoted in red box. The color of the nodes in the graph corresponds to the color of the parts in the object.

Qualitative comparison on PartNet-Mobility

We also show the comparison on the ACD dataset in a zero-shot manner.

Qualitative comparison on ACD

BibTeX


      @article{jiayi2024singapo,
          author    = {Liu, Jiayi and Iliash, Denys and Chang, Angel X. and Savva, Manolis and Mahdavi-Amiri, Ali},
          title     = {{SINGAPO}: Single Image Controlled Generation of Articulated Parts in Object},
          year      = {2024},
          journal   = {arXiv preprint arXiv:2410.16499}
      }

SINGAPO: Single Image Controlled Generation of Articulated Parts in Objects

Abstract

Methods Overview

Qualitative Comparison

BibTeX