MultiScan

MultiScan provides a dataset of 3D scans of annotated parts and articulation parameters.

Openable Part Detection

We address the task of predicting what parts of an object can open and how they move when they do so. The input is a single image of an object, and as output we detect what parts of the object can open, and the motion parameters describing the articulation of each openable part.

OPD  OPDMulti  

3DHOI

Human-object interactions with articulated objects are common in everyday life, but it is challenging to infer an articulated 3D object model from an RGB video showing a person manipulating the object. We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video, and carry out a systematic benchmark of four methods for this task: 3D plane estimation, 3D cuboid estimation, CAD model fitting, and free-form mesh fitting.

TriCoLo

TriColo investigates use of contrastive loss to build a trimodal embedding space over text, images, and voxels for text-to-image retrieval.

3D VQA

We study VQA in the 3D domain, with our input being point clouds of realworld 3D scenes, instead of 2D images. We believe that this 3D data modality provide richer spatial relation information that is of interest in the VQA task. In this paper, we introduce the 3DVQA-ScanNet dataset, the first VQA dataset in 3D, and we investigate the performance of a spectrum of baseline approaches on the 3D VQA task.

LAW-VLNCE

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle ‘off the path’ scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent’s location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.

Multi-Object Navigation

We propose and benchmark the Multi-Object Navigation (MultiON) task, where an agent needs to navigate to multiple objects in a given sequence.

task  challenge  modular approach  

Mirror3D

Despite recent progress in depth sensing and 3D reconstruction, mirror surfaces are a significant source of errors. To address this problem, we create the Mirror3D dataset: a 3D mirror plane dataset based on three RGBD datasets (Matterpot3D, NYUv2, and ScanNet) containing 7,011 mirror instance masks and 3D planes. We then develop Mirror3DNet: a module that refines raw sensor depth or estimated depth to correct errors on mirror surfaces.

Plan2Scene

We address the task of converting a floorplan and a set of associated photos of a residence into a textured 3D mesh model, a task which we call Plan2Scene. Our system 1) lifts a floorplan image to a 3D mesh model; 2) synthesizes surface textures based on the input photos; and 3) infers textures for unobserved surfaces using a graph neural network architecture.