BIOSCAN

Monitoring and understanding the biodiversity of our world is becoming increasingly critical. BIOSCAN is a large, inter-disclinary effort lead by the International Barcode of Life (iBOL) Consortium to develop a global biodiversity monitoring system. As part of this larger project, we have ongoing collaborations with University of Guelph and University of Waterloo to explore how to use recent development in machine learning to assist with biodiversity monitoring. As a first step, we have introduced datasets (BIOSCAN-1M,BIOSCAN-5M) and developed models for taxononomic classification.

BIOSCAN-1M  BIOSCAN-5M  BIOSCAN cropping tool  BarcodeBERT  CLIBD  

Articulated object understanding and generation

Everyday indoor environments are filled with interactable, articulated objects. We aim to be able to create such interactive environments. To better understand the types of articulated objects found in the real-world, we introduce MultiScan, a dataset of 3D scans of annotated parts and articulation parameters. We also work on the reconstruction of articulated objects from two views (PARIS) and generative models for creating new articulated objects (CAGE).

MultiScan  PARIS  CAGE  SINGAPO  S2O  

Openable Part Detection

We address the task of predicting what parts of an object can open and how they move when they do so. The input is a single image of an object, and as output we detect what parts of the object can open, and the motion parameters describing the articulation of each openable part. We introduce the task of Openable-Part Detection (OPD), and extend it to images with multiple objects in OPDMulti.

OPD  OPDMulti  

3DHOI

Human-object interactions with articulated objects are common in everyday life, but it is challenging to infer an articulated 3D object model from an RGB video showing a person manipulating the object. We canonicalize the task of articulated 3D human-object interaction reconstruction from RGB video, and carry out a systematic benchmark of four methods for this task: 3D plane estimation, 3D cuboid estimation, CAD model fitting, and free-form mesh fitting.

Text-image-shape embeddings

We have several projects that investigates the use of contrastive loss to build a trimodal embedding space over text, images, and shapes for text-to-image retrieval.

TriColo  DuoduoCLIP  

Understanding language in 3D scenes

We propose the task of visual grounding (ScanRefer) and dense captioning (Scan2Cap) in 3D scenes and introduce datasets. To study the tasks, we propose baselines, and models that combine the two tasks (D3Net and Unit3D). In Multi3DRefer, we extend the visual grounding task to the more realistic setting where each language description can correspond to multiple objects. We also address 3D Visual Question Answering in 3DVQA.

ScanRefer  Scan2Cap  D3Net  UniT3D  Multi3DRefer  3D VQA  

LAW-VLNCE

In the Vision-and-Language Navigation (VLN) task an embodied agent navigates a 3D environment, following natural language instructions. A challenge in this task is how to handle ‘off the path’ scenarios where an agent veers from a reference path. Prior work supervises the agent with actions based on the shortest path from the agent’s location to the goal, but such goal-oriented supervision is often not in alignment with the instruction. Furthermore, the evaluation metrics employed by prior work do not measure how much of a language instruction the agent is able to follow. In this work, we propose a simple and effective language-aligned supervision scheme, and a new metric that measures the number of sub-instructions the agent has completed during navigation.

Multi-Object Navigation

We propose and benchmark the Multi-Object Navigation (MultiON) task, where an agent needs to navigate to multiple objects in a given sequence.

task  challenge  modular approach  

Mirror3D

Despite recent progress in depth sensing and 3D reconstruction, mirror surfaces are a significant source of errors. To address this problem, we create the Mirror3D dataset: a 3D mirror plane dataset based on three RGBD datasets (Matterpot3D, NYUv2, and ScanNet) containing 7,011 mirror instance masks and 3D planes. We then develop Mirror3DNet: a module that refines raw sensor depth or estimated depth to correct errors on mirror surfaces.

Plan2Scene

We address the task of converting a floorplan and a set of associated photos of a residence into a textured 3D mesh model, a task which we call Plan2Scene. Our system 1) lifts a floorplan image to a 3D mesh model; 2) synthesizes surface textures based on the input photos; and 3) infers textures for unobserved surfaces using a graph neural network architecture.