SpatialStack
Layered geometry-language fusion for 3D VLM spatial reasoning.
SpatialStack is a hierarchical geometry-language fusion framework for 3D VLM spatial reasoning. Instead of injecting a single late-stage geometry feature, it stacks multi-level geometric features and progressively fuses them into the language decoder, improving both local geometric precision and high-level spatial reasoning.
The project page includes the paper, code, model, and dataset, along with qualitative results and benchmark comparisons on VSI-Bench and CV-Bench.
Project page / Paper / Code / Model / Data