Jian Zhang 张舰

Hello, I am Jian Zhang (张舰), and I also go by Dylan. Over the past two years, I have had impactful collaborations with Yue Huang and Xinghao Ding, through which I developed core research skills and a clear long-term goal: building systems that can perceive, decide, and act in the physical world like humans. During this period, I also had the opportunity to collaborate with Dr. Zhiwen Fan. I believe this direction can fundamentally reshape society. I plan to start my PhD at Texas A&M University in Fall 2026.

In my early stage, I focused on faster 3D reconstruction and semantic 3D representation. I am now increasingly focused on intelligence for embodied systems in the physical world.

I am currently seeking internship opportunities. Feel free to contact me by email.

selected publications

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

Jian Zhang^*, Shijie Zhou^*, Bangya Liu^*, Achuta Kadambi, and Zhiwen Fan

In CVPR, 2026

Abs PDF Video Code Website

Large vision-language models still struggle with reliable 3D spatial reasoning because conventional geometry fusion discards rich hierarchical signals from the geometry encoder. SpatialStack progressively aligns vision, geometry, and language representations across the model hierarchy by stacking multi-level geometry features and injecting them into the language decoder. Building on this framework, VLM-SpatialStack achieves state-of-the-art results on multiple 3D spatial reasoning benchmarks and demonstrates strong generalization across diverse spatial understanding tasks.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan^*, Jian Zhang^*, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan

In CVPR, 2026

Abs PDF Video Code Website

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM-3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM-3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Leveraging our Spatial-Visual-View Fusion and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM-3R effectively aligns real-world spatial context with language instructions. This enables monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM-3R, not only facilitates robust visual-spatial reasoning but also enables the understanding of temporal 3D context changes, excelling in both accuracy and scalability.
Large spatial model: End-to-end unposed images to semantic 3d

Zhiwen Fan^*, Jian Zhang^*, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, Boris Ivanovic, Marco Pavone, and Yue Wang

In NeurIPS, 2024

Abs PDF Video Code Website

Reconstructing and understanding 3D structures from a limited number of images is a well-established problem in computer vision. Traditional methods usually break this task into multiple subtasks, each requiring complex transformations between different data representations. For instance, dense reconstruction through Structure-from-Motion (SfM) involves converting images into key points, optimizing camera parameters, and estimating structures. Afterward, accurate sparse reconstructions are required for further dense modeling, which is subsequently fed into task-specific neural networks. This multi-step process results in considerable processing time and increased engineering complexity. In this work, we present the Large Spatial Model (LSM), which processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation, and it can generate versatile label maps by interacting with language at novel viewpoints. Leveraging a Transformer-based architecture, LSM integrates global geometry through pixel-aligned point maps. To enhance spatial attribute regression, we incorporate local context aggregation with multi-scale fusion, improving the accuracy of fine local details. To tackle the scarcity of labeled 3D semantic data and enable natural language-driven scene manipulation, we incorporate a pre-trained 2D language-based segmentation model into a 3D-consistent semantic feature field. An efficient decoder then parameterizes a set of semantic anisotropic Gaussians, facilitating supervised end-to-end learning. Extensive experiments across various tasks show that LSM unifies multiple 3D vision tasks directly from unposed images, achieving real-time semantic 3D reconstruction for the first time.