Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos

1Simon Fraser University, 2Shanghai Jiao Tong University

Abstract

Articulated objects are prevalent in daily life. Understanding their kinematic structure and reconstructing them have numerous applications in embodied AI and robotics. However, current methods require carefully captured data for training or inference, preventing practical, scalable, and generalizable reconstruction of articulated objects. We focus on reconstruction of an articulated object from a casually captured RGBD video shot with a hand-held camera. A casually captured video of an interaction with an articulated object is easy to acquire at scale using smartphones. However, this setting is quite challenging, as the object and camera move simultaneously and there are significant occlusions as the person interacts with the object. To tackle these challenges, we introduce a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video. To evaluate our method under this new setting, we build a 20× larger synthetic dataset of 784 videos containing 284 objects across 11 categories. We compare our approach with existing methods that also take video as input. Experiments show that our method can reconstruct synthetic and real articulated objects across different categories from dynamic RGBD videos, outperforming existing methods significantly.

Video Summary

Pipeline Overview

Given an casually captured RGBD video, our pipeline first estimates joint parameters and a movable part segmentation via image feature matching. Then, a gradient-based optimization framework refines these initial estimates against a surface point cloud representation of the object acquired through 3D reconstruction. Different from some existing work modeling articulated objects with implicit representation, our pipeline explicitly parameterizes the articulation joint parameters and other relevant parameters. Moreover, our pipeline neither relies on an external library nor requires additional data for fine-tuning. Instead, it only uses pretrained models.

Coarse Prediction

An overview of our coarse prediction pipeline. We first use feature matching in the static regions to estimate relative camera poses and align all observations to the same coordinate. Then, we compute the transformation using feature matching in the dynamic regions to estimate joint parameters for that pair of frames. Finally, we average out all the results to produce a joint parameter estimation.

Refinement

An overview of our refinement pipeline. We transform observations in the video back to the initial stage with camera poses and joint parameters. We then compute the chamfer distance from the transformed observation to the object surface as a loss function and optimize relevant parameters.

Dataset Overview

Since the problem setting and task setup we described has not been previously addressed, we build a new dataset for evaluation. We select 284 synthetic objects from 11 categories in the PartNet-Mobility dataset to generate dynamic videos for input and evaluate the performance of our method. Compared to datasets used in prior works, our dataset contains 20× more objects and thus provides a more robust evaluation of the performance of different methods for reconstructing articulated objects from dynamic videos.

Data Generation Setup

Scene inllustration

We build a simple cuboid environment in the SAPIEN simulator with realistic textures and place the object inside to manipulate it.

We first render multi-view RGBD images of the object at the initial state to generate an object surface point cloud on the right-hand side. Then, to increase the diversity of the video, we place two cameras in front of the moving part of the object to record the interaction videos from two different views.

Data distribution Data distribution on Joints

We finally generate 784 videos of 284 objects across 11 categories. The data distribution across different categories is shown in the bar chart above.

Data Samples

Experiment Results

Synthetic Data

Input Video
Ground Truth
Articulate-Anything
RSRD
RSRD result for microwave
Ours
Failed

Real Data

Input Video
Articulate-Anything
RSRD
Failed
Ours
Failed

Acknowledgments

This work was funded in part by a Canada Research Chair, NSERC Discovery Grant, and enabled by support from the Digital Research Alliance of Canada. The authors would like to thank Jiayi Liu, Xingguang Yan, Austin T. Wang, Hou In Ivan Tam, Morteza Badali for valuable discussions, and Yi Shi for proofreading.

BibTeX


      @article{peng2025videoarticulation,
        Author = {Weikun Peng and Jun Lv and Cewu Lu and Manolis Savva},
        Title = {Generalizable Articulated Object Reconstruction from Casually Captured RGBD Videos},
        Year = {2025},
        Journal={arXiv preprint arXiv:2506.08334},
      }