SpatialStack

Layered geometry-language fusion for 3D VLM spatial reasoning.

SpatialStack is a hierarchical geometry-language fusion framework for 3D VLM spatial reasoning. Instead of injecting a single late-stage geometry feature, it stacks multi-level geometric features and progressively fuses them into the language decoder, improving both local geometric precision and high-level spatial reasoning.

The project page includes the paper, code, model, and dataset, along with qualitative results and benchmark comparisons on VSI-Bench and CV-Bench.

Project page / Paper / Code / Model / Data