Paper Accepted at ICLR 2026 on Spatial Reasoning in Vision–Language Models
Our paper, SpatialLab: Can Vision–Language Models Perform Spatial Reasoning in the Wild?, has been accepted to the ICLR 2026, in collaboration with QCRI, Monash University, and CIOL.
In this work, we introduce SpatialLab, a diagnostic benchmark for evaluating spatial reasoning in Vision–Language Models under real-world settings.The benchmark probes fine-grained spatial skills—such as relative position, depth, containment, orientation, and compositional spatial relations—and is designed to reduce shortcut cues common in synthetic or templated datasets. Our evaluation shows that while modern VLMs perform well on surface-level spatial perception, they consistently struggle with multi-step and relational spatial reasoning, revealing a gap between visual recognition and grounded spatial understanding.