3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.
While recent efforts have attempted to scale up visual grounding datasets, many of these datasets are insufficiently diverse in capturing the range of linguistic patterns. We analyze the prompts of prior 3DVG datasets according to 6 count-based and 24 binary metrics characterizing the targets, anchors, attributes, relationships, and other language patterns. We propose an automated pipeline based on LLMs and other NLP tools to analyze each prompt and run our pipeline against 1000 sampled prompts from each dataset to identify some key linguistic patterns which are absent in most existing datasets.
We release ViGiL3D, a human-annotated dataset of 350 prompts from 26 ScanNet and ScanNet++ scenes with diverse linguistic patterns for benchmarking 3DVG methods. Each prompt describes zero, one, or multiple target objects in each scene with a variety of language criteria, including anchor object references, a range of attribute and relationship types such as text labels and arranagements, coreferences, and negation. We show a series of example prompts from ViGiL3D below with a few of the many represented patterns.
Evaluating accuracy against both ground truth and Mask3D-predicted boxes, we find that all methods perform significantly worse on ViGiL3D relative to the performance on ScanRefer, demonstrating that our prompts are overall more challenging than prior benchmarks. Furthermore, our dataset allows for more fine-grained analysis of 3DVG methods on different linguistic patterns, identifying a need for further improvement of performance on patterns such as text labels, generic references, and negations.
@article{wang2024vigil3d,
author={Wang, Austin T. and Gong, ZeMing and Chang, Angel X.},
title={{ViGiL3D}: A Linguistically Diverse Dataset for 3D Visual Grounding},
journal={arXiv preprint},
year={2024},
eprint={2501.01366},
archivePrefix={arXiv},
primaryClass={cs.CV},
doi={10.48550/arxiv.2501.01366},
}
This work was funded in part by a CIFAR AI Chair and the NSERC Discovery Grant. We thank Yiming Zhang, Hou In Ivan Tam, Hou In Derek Pun, Xingguang Yan, and Karen Yeh for helpful discussions and feedback.