EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

Simon Fraser University
Arxiv 2026

We present EgoFun3D: a coordinated task, dataset and benchmark for modeling interactive 3D objects from egocentric videos. Given an egocentric video as an input, the output is a simulation-ready interactive object (e.g., faucet handle starts water flow from faucet spout). We break down the task into 4 steps to propose a baseline approach using off-the-shelf components. Our function template representation, produced by the proposed system, can be compiled into executable code for a simulator of choice.

Abstract

We present EgoFun3D, a coordinated task formulation, dataset, and benchmark for modeling interactive 3D objects from egocentric videos. Interactive objects are of high interest for embodied AI but scarce, making modeling from readily available real-world videos valuable. Our task focuses on obtaining simulation-ready interactive 3D objects from egocentric video input. While prior work largely focuses on articulations, we capture general cross-part functional mappings (e.g., rotation of stove knob controls stove burner temperature) through function templates, a structured computational representation. Function templates enable precise evaluation and direct compilation into executable code across simulation platforms. To enable comprehensive benchmarking, we introduce a dataset of 271 egocentric videos featuring challenging real-world interactions with paired 3D geometry, segmentation over 2D and 3D, articulation and function template annotations. To tackle the task, we propose a 4-stage pipeline consisting of: 2D part segmentation, reconstruction, articulation estimation, and function template inference. Comprehensive benchmarking shows that the task is challenging for off-the-shelf methods, highlighting avenues for future work.

Part Functionality

Part Functionality

Illustration of a typical form of human-object interaction. An agent interacts with a receptor, changing its state. Part functionality defines how the state change of the receptor maps to the state change of the effector. On the right, we provide an example of human interacting with a knob of the stove. The part function triggers the temperature change of the burner after knob actuation.

Baseline Method

Baseline Method

Our baseline framework. We break down the task into 4 steps that are individually targeted with off-the shelf components. First, a VLM generates part descriptions which are used to segment the parts in the video. Then, the geometry of the receptor and the effector are reconstructed, articulation parameters are estimated, and the function template is inferred. These outputs are combined to build the interactive object.

EgoFun3D Dataset

2D Segmentation
3D Segmentation and Articulation
3D segmentation and articulation example
Function Template Physical Effect: Fluid Change
Mapping: Linear
Physical Effect: Geometry Change
Mapping: Step
Physical Effect: Illumination Change
Mapping: Binary
Interactive Object
Data Source Ego-Exo4D FunGraph3D Self-captured

Examples of annotations in our dataset. We provide 2D segmentation masks for hands, receptor (in teal), effector (in orange), and the whole object. We annotate part segmentation for receptor and effector on reconstructed 3D meshes. For articulation, we annotate revolute and prismatic joints, shown as red and green arrows respectively. For the function template, we pick one of four physical effects and one of four numerical expressions. Finally, we show concrete instantiations of interactive objects in different simulators: Genesis (left), Isaac Sim (middle), BEHAVIOR (right).

Object category distribution for physical effects Object category distribution for mapping

Our egocentric video dataset distributions across object categories. There are prominent long tail distributions across categories, physical effects, and function mappings, primarily due to inherited biases from source datasets such as Ego-Exo4D.

Benchmark Results

2D Segmentation

Model IoU (%) Success (%) Avg. Run. (s)
Receptor Effector Avg. Receptor Effector
VisionReasoner 14.6 34.9 24.7 7.4 33.9 452
SAM3 & Qwen3-VL 30.0 47.9 38.8 23.2 55.0 2012
SAM3 & Molmo2 14.0 29.4 21.7 6.6 20.7 392
X-SAM 2.5 15.0 8.7 0.4 9.2 255
Sa2VA 14.8 43.4 29.1 1.8 33.2 2006

As downstream pipeline steps rely on segmentation, we select only the parts with average IoU greater than 50% for further evaluation of the downstream modules. We consider such cases to be a segmentation success, and report the success rates additionally. We find that SAM3 & Qwen3-VL outperforms other methods by a large margin, but is very inefficient.

Ground Truth
SAM3-Agent+Qwen3VL

Example 2D segmentation results. We find that SAM3 with Qwen3-VL provides the best segmentation. The main challenges in this subtask are segmentation of incorrect parts (left) and confusion between part instances across frames (middle). Performance on videos featuring more static viewpoints and no part instance ambiguity is better, though such videos are rare (right).

Reconstruction

Method Receptor CD (m^2) Effector CD (m^2) Total CD (m^2) Camera Rot. Err. (rad) Camera Tr. Err. (m)
MapAnything 0.380 0.953 0.580 1.033 0.742
Depth Anything 3 0.026 0.014 0.016 0.045 0.049
ViPE 0.034 0.021 0.025 0.046 0.058

Evaluating reconstruction. We report the median value of the chamfer distance and mean value of camera pose prediction error. Depth Anything 3 performs the best out of the methods we benchmark. MapAnything severely underperforms due to the camera predictions errors.

Reconstruction GT example 1
Reconstruction DA3 example 1
Reconstruction MAP-Anything example 1
Reconstruction VIPE example 1
Reconstruction GT example 2
Reconstruction DA3 example 2
Reconstruction MAP-Anything example 2
Reconstruction VIPE example 2
Ground Truth Depth Anything 3 MapAnything ViPE

Example results for reconstruction. MapAnything exhibits severe drifting issues as predicted camera poses for different video frames are inaccurate. Other approaches also exhibit significant artifacts. Overall, reconstruction from our egocentric video data is highly challenging for all methods.

Articulation Estimation

Method Joint Axis Err. (Rad) Joint Origin Err. (m) Joint Type Acc. (%) Failure Rate (%)
Artipoint 1.057 0.346 74.2 46.4
iTACO 1.022 0.665 26.8 5.6

Evaluating articulation parameters estimation. We report the mean error across the videos that successfully go through the whole pipeline. We find that Artipoint is more accurate than iTACO, but is less robust. The overall performance for both methods is very low, indicating that articulation estimation is one of the bottlenecks.

Articulation GT example 1
Articulation ArtiPoint example 1
Articulation ItaCo example 1
Articulation GT example 2
Articulation ArtiPoint example 2
Articulation ItaCo example 2
Ground Truth ArtiPoint iTACO

Example results for articulation estimation. Red arrows refer to revolute joints and green arrows refer to prismatic joints. In the left example, iTACO predicts incorrect joint types, whereas Artipoint is correct. However, both methods struggle with small parts such as the stove knob shown here.

Function Prediction

Method Physical Effect Acc. (%) Mapping Acc. (%) Overall Acc. (%)
Gemini 3 Flash 95.2 97.6 92.9
GPT-5 mini 88.1 90.5 83.3
Molmo2 8B 90.5 31.0 28.6
Qwen3-VL 8B 97.6 76.2 76.2

Evaluation of function template inference accuracy. We report prediction accuracy for physical effect, mapping, and overall accuracy. A function template is correct if both effect and mapping are correct. We only report accuracy across videos where both receptor and effector segmentation IoUs are larger than 0.5. Among the four different VLMs we benchmarked on this task, Gemini-3-flash performs the best.

Final Interactive Object

Qualitative results of the final outputs of our system. For each pair of videos, the left video shows the ground truth and the right video shows the prediction. The faucet results are from Genesis, the stove and lamp results are from BEHAVIOR, and the fridge door result is from Isaac Sim. We use teal to indicate receptors and orange to indicate effectors.

Acknowledgments

This work was funded in part by a Canada Research Chair, NSERC Discovery Grant, and enabled by support from the Digital Research Alliance of Canada. The authors would like to thank Tianrun Hu from National University of Singapore for collecting data, Jiayi Liu, Xingguang Yan, Austin T. Wang, and Morteza Badali for valuable discussions and proofreading.

BibTeX


@article{peng2026egofun3d,
  title={{EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates}},
  author={Peng, Weikun and Iliash, Denys and Savva, Manolis},
  journal={arXiv preprint arXiv:2604.11038},
  year={2026}
}