Validity Pitfalls in VSI-Bench

The prior spatial intelligence benchmark VSI-Bench becomes systematically unreliable under modern vision-language model settings. We highlight two key pitfalls below.

Pitfall 1

3D Annotation-to-Video Drift

VSI-Bench derives ground-truth answers for question-answer pairs from low-quality 3D reconstructions and noisy annotations in scene datasets (e.g., ScanNet v2, ScanNet++ v2, ARKitScenes), leading to a substantial portion of incorrect ground-truth answers across tasks.

VSI-Bench Ground-Truth Quality

Diagnostics on object counting and object size tasks

GT Correctness · Object Counting

565 total
questions
  • Correct
  • Wrong
  • Ambiguous

Object counting shows notable incorrectness and ambiguity (e.g., ill-defined object criteria like shoes).

“Door” Height Distribution (cm)

Door height distributions as proxies show systematic obj. size errors. Red bins denote physically implausible sizes.

Statistics are derived from human annotations.

3D Reconstructed Mesh
Raw Video

VSI-Bench QA Pairs

How many chair(s) are in this room?

VSI-Bench GT23
Actual GT27

How many table(s) are in this room?

VSI-Bench GT5
Actual GT8

What is the length of the longest dimension (length, width, or height) of the door, measured in centimeters?

VSI-Bench GT151
Actual GT> 195

What is the size of this room (in square meters)? If multiple rooms are shown, estimate the size of the combined space.

VSI-Bench GT24.9
Actual GT40.0
3D Reconstructed Mesh
Raw Video

VSI-Bench QA Pairs

How many chair(s) are in this room?

VSI-Bench GT18
Actual GT28

What is the length of the longest dimension (length, width, or height) of the door, measured in centimeters?

VSI-Bench GT148
Actual GT> 195

What is the size of this room (in square meters)? If multiple rooms are shown, estimate the size of the combined space.

VSI-Bench GT18.7
Actual GT29.2
Pitfall 2

Video Sampling Matters!

Prior evaluations assume full-scene access, whereas most vision-language models operate on sparsely sampled frames (e.g., 16–64). This mismatch leads to missing objects and geometries, rendering a large portion of questions unanswerable or incorrect.

Questions are unanswerable when queried objects are absent, and incorrect when frame-based answers deviate from full-scene ground truth.

VSI-Bench Question Answerability & Correctness

by Frame Budget

Statistics are derived from human annotations. Video frames are uniformly sampled using np.linspace.

3D Reconstructed Mesh
Raw Video
32-Frame Sampling
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7 Frame 8 Frame 9 Frame 10 Frame 11 Frame 12 Frame 13 Frame 14 Frame 15 Frame 16 Frame 17 Frame 18 Frame 19 Frame 20 Frame 21 Frame 22 Frame 23 Frame 24 Frame 25 Frame 26 Frame 27 Frame 28 Frame 29 Frame 30 Frame 31
16-Frame Sampling
Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Frame 7 Frame 8 Frame 9 Frame 10 Frame 11 Frame 12 Frame 13 Frame 14 Frame 15

VSI-Bench QA Pairs

Measuring from the closest point of each object, what is the distance between the telephone and the tv (in meters)?

VSI-Bench GT9.3
Actual GT9.3 (32-Frame), N/A (16-Frame)

Telephone is invisible under 16-Frame sampling

How many suitcase(s) are in this room?

VSI-Bench GT3
Actual GT2 (32-Frame), 1 (16-Frame)

Visible suitcase count varies with frame sampling rate

What will be the first-time appearance order of the following categories in the video: sofa, door, suitcase, kettle?

VSI-Bench GTdoor, sofa, suitcase, kettle
Actual GTN/A (32-Frame), N/A (16-Frame)

Suitcase and sofa appear in the same frame

ReVSI

• Rethink • Rebuild • Reevaluate

High Quality — Expert-Level Annotations

Every ReVSI scene is fully annotated by 3D experts — without any heuristic or model-predicted shortcuts — and passes multiple rounds of video-aware verification, ensuring accurate object names, instance identity, physical size, and faithful room-area boundaries.

Object Annotation
bde1e479ad · ScanNet++ v2
VSI-Bench
ReVSI
Room Size Annotation
Scene floor plan VSI-Bench annotations ReVSI annotations scene0441_00 · ScanNet v2
VSI-Bench (6.9 m²)
ReVSI (9.9 m²)

Great Diversity — Broader Scenes, Richer Objects & Varied Question Templates

ReVSI covers more scenes, more object instances, and a substantially larger object vocabulary than VSI-Bench, with open-vocabulary support that frees evaluation from a fixed label set.

Benchmark Scale
Open Vocabulary Object Labels
VSI-Bench (65 Labels)
ReVSI (504 Labels)
Question Templates

Balanced Answer Distribution Mitigates Frequency-Based Shortcuts

Answers are spread evenly across the value range for both numerical and multiple-choice questions, preventing models from scoring high by exploiting skewed answer frequencies.

Object Counting · Answer Distribution
Absolute Distance · Answer Distribution

Frame-Adaptive What We Ask, What the Model Sees

Ground-truth answers adapt to different input frame budget, reflecting what the model actually sees.

Same Question · Frame-Specific Ground Truth

Question: How many chair(s) are in this room?

Answer:

All-Frame
12
64-Frame
11
32-Frame
9
16-Frame
6

Experiments

Finding 1

Open-Source Models Are Systematically Overestimated

Evaluations on ReVSI with corresponding VSI-Bench scores shown in parentheses. For ReVSI, each model is evaluated against frame-sampling-specific ground-truth answers corresponding to its inference frames (shown in Frames column), whereas VSI-Bench uses a single set of ground-truth answers shared across all frame settings. ReVSI scores that exceed their VSI-Bench counterparts are highlighted in green.

Method Frames Numerical Question Multiple-Choice Question Avg.
Obj. Cnt. Abs. Dist. Obj. Size Room Size Rel. Dist. Rel. Dir. Route Plan
Baseline
Chance (Random) ALL - - - - 23.7 25.0 26.8 36.1 26.0 28.3 -
Chance (Frequency) ALL 52.2 62.1 40.1 32.0 17.4 29.9 20.9 33.1 25.8 25.1 31.9 47.9 30.2 28.4 31.4 34.0
Proprietary Models (API)
GPT-5.2 64 56.2 57.1 41.5 33.4 73.9 64.6 63.0 59.0 48.4 48.0 34.9 33.3 38.2 36.7 50.9 49.2
Gemini 3 Flash 1 FPS 65.7 45.6 53.1 36.3 77.6 74.9 52.8 47.4 64.6 54.3 47.9 52.4 41.8 50.0 57.6 55.9
Gemini 3 Pro 1 FPS 60.1 45.3 54.7 38.3 79.3 73.0 51.9 47.4 68.1 70.0 56.0 60.8 56.4 65.3 60.9 60.5
Open-Source Models
Qwen3-VL-8B-Instruct 64 40.4 70.0 52.3 50.5 69.0 74.7 45.1 63.3 57.1 57.3 39.5 52.3 40.5 33.5 49.1 57.4
Qwen3-VL-32B-Instruct 64 46.9 74.0 65.0 57.0 70.4 76.6 55.8 70.8 53.8 55.6 34.0 59.1 47.3 39.7 53.3 61.8
InternVL3.5-8B 64 43.3 72.7 54.6 40.3 64.2 68.4 47.6 65.3 45.0 57.0 36.3 48.6 44.4 35.6 47.9 55.4
InternVL3.5-38B 64 43.8 73.9 60.6 39.2 70.2 73.0 58.4 65.0 57.4 66.2 45.9 72.0 42.7 36.1 54.1 60.8
LLaVA-Video-7B-Qwen2 64 31.3 48.5 1.4 14.0 52.5 47.8 16.7 24.2 38.3 43.5 33.3 42.4 38.4 34.0 30.3 36.3
LLaVA-Video-72B-Qwen2 64 40.1 48.9 29.6 22.8 59.3 57.4 27.9 35.3 39.6 42.4 24.8 36.7 43.0 35.0 37.8 40.9
Finding 2

Spatial Fine-Tuning Does Not Reliably Generalize

Performance of specialized 3D VLMs and their base models on ReVSI with corresponding VSI-Bench scores shown in parentheses. ReVSI evaluates each model under its native inference frame setting using frame-adaptive ground-truth answers. Scores lower than the base model are highlighted in red.

Method Frames Numerical Question Multiple-Choice Question Avg.
Obj. Cnt. Abs. Dist. Obj. Size Room Size Rel. Dist. Rel. Dir. Route Plan
Qwen2.5-7B-Instruct+SigLIP2------------------
Cambrian-S-7B 128 48.4 73.2 60.5 50.5 65.5 74.9 46.7 72.2 37.1 71.1 48.5 76.2 37.0 41.8 49.1 67.5
Qwen2.5-VL-7B-Instruct 4 FPS 36.9 36.8 15.0 17.6 49.7 51.0 29.0 29.2 31.5 35.4 29.5 38.4 36.7 33.5 32.6 32.6
VST-7B-SFT 4 FPS 35.4 72.0 52.6 44.4 67.9 74.3 47.2 68.3 49.2 59.7 36.9 55.8 35.4 44.9 46.4 65.2
Qwen2.5-VL-7B-Instruct 32 34.3 43.7 21.7 22.3 45.5 49.2 35.1 37.5 32.6 40.1 33.7 38.9 34.1 32.0 33.9 37.7
SpaceR-7B (SG-RLVR) 32 30.7 61.9 34.5 28.6 52.0 60.9 18.6 35.2 22.8 38.2 34.5 46.0 20.2 31.4 30.5 43.5
Qwen2.5-VL-3B-Instruct 16 18.7 24.3 15.6 24.7 16.8 31.7 -- 22.6 33.2 38.3 34.3 41.6 -- 26.3 23.7 30.6
Spatial-MLLM-4B-135k 16 40.7 65.8 45.3 40.7 46.8 58.3 -- 55.6 32.3 43.2 37.4 55.5 -- 36.1 40.5 50.7
Spatial-MLLM-4B-820k 16 41.5 66.7 40.0 37.9 53.1 69.7 -- 55.7 30.7 52.0 39.2 54.9 -- 39.7 40.9 53.8
LLaVA-Video-7B-Qwen2 32 29.9 48.5 1.5 14.0 53.0 47.8 19.3 24.2 39.1 43.5 33.8 42.4 38.8 34.0 30.8 36.3
VLM3R-7B 32 41.6 70.2 61.6 49.4 64.8 69.2 52.5 67.1 46.5 65.4 49.5 80.5 34.1 45.4 50.1 60.9
Finding 3

Vision-Language Models Can Overlook Visual Inputs

Object counting results on dummy videos where the ground-truth answer is always 0. Query-Drop removes frames containing the queried object while preserving the surrounding scene and other objects. First-Frame repeats the first frame of the Query-Drop video across all frames. Black uses pure black frames for the whole video. Exact Match accuracy is reported over 997 questions, with 0 as the only correct answer. Fine-tuned models tend to lean on strong priors learned from training data when answering, often overlooking the actual visual input.

Real Video
real frame 0 real frame 1
···
real frame 2
Query-Drop Video
query-drop frame 0 query-drop frame 1
···
query-drop frame 2
First-Frame Video
first-frame repeat first-frame repeat
···
first-frame repeat
Black Video
···

Three dummy-video constructions used to probe whether models rely on visual evidence when answering object-counting questions.

Method Accuracy (Exact Match)
Query-Drop First-Frame Black
Zero-shot
Human100.0100.0100.0
GPT-5.274.089.699.2
Gemini 3 Pro62.385.094.0
Qwen2.5-VL-7B-Instruct55.879.999.8
Qwen3-VL-8B-Instruct34.780.999.8
Qwen3-VL-32B-Instruct50.592.7100.0
InternVL3.5-8B14.752.517.7
InternVL3.5-38B9.145.01.2
LLaVA-Video-7B-Qwen247.749.90.0
LLaVA-Video-72B-Qwen245.065.615.2
Fine-tuned
Cambrian-S-7B1.12.80.0
VST-7B-SFT1.18.70.4
SpaceR-7B (SG-RLVR)8.124.314.6
Spatial-MLLM-4B-135k0.20.90.0
Spatial-MLLM-4B-820k0.00.00.0
VLM3R-7B4.12.50.2

Resources

ReVSI supports the following inference and evaluation frameworks:

Citation

If you find this work useful, please consider citing our paper:

@article{zhang2026revsi,
  title={ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning},
  author={Zhang, Yiming and Chen, Jiacheng and Tan, Jiaqi and Mao, Yongsen and Chen, Wenhu and Chang, Angel X.},
  journal={arXiv preprint arXiv:2604.24300},
  year={2026}
}

Acknowledgment

This work was funded in part by a CIFAR AI Chair and NSERC Discovery Grants, and enabled by support from Digital Research Alliance of Canada and a CFI/BCKDF JELF grant. We thank Jiayi Liu for help with benchmark data verification and discussion.