Research direction

3D Vision

Spatial intelligence for the physical world—reconstruction, understanding, generation, and embodied interaction.

Topic id
3d-vision

3D Vision is a foundational direction for building AI systems that understand, reconstruct, generate, and interact with the physical world. While 2D vision focuses on images and videos as visual observations, 3D vision asks a deeper question: how can machines recover the structure of the world behind visual signals, reason about space and geometry, and use that understanding to support perception, creation, simulation, and action? This makes 3D Vision a strategic research area that connects computer vision, graphics, robotics, embodied AI, augmented reality, digital twins, and world modeling.

The importance of 3D Vision comes from the fact that intelligence in the real world is inherently spatial. Humans do not only recognize objects. We understand where things are, how they are shaped, how they relate to each other, how they can be manipulated, and how scenes change when we move through them. Future AI systems will need similar spatial competence. They must build persistent representations of environments, reason across viewpoints, infer hidden structure from partial observations, understand object geometry and scene layout, and support interaction in both physical and virtual spaces. In this sense, 3D Vision provides the geometric foundation for visual intelligence.

Our vision is to develop 3D visual systems that move beyond passive reconstruction toward spatial understanding, generative modeling, and embodied interaction. A strong 3D system should not only reconstruct what is visible, but also infer what is occluded, align perception across time and viewpoints, understand semantic and functional relationships, and represent the world in a way that can support reasoning and action. This requires models that combine geometry, appearance, semantics, dynamics, and physical structure in a unified spatial representation. At a strategic level, our work connects reconstruction and representation, scene understanding over time, controllable 3D creation, and embodied world models into one research program.

Reconstruction and spatial representation

The first goal of 3D Vision is to recover the structure of scenes, objects, and environments from visual observations. This includes reasoning about geometry, depth, surfaces, volumes, camera motion, and multi-view consistency. More broadly, it asks how visual systems should represent the world so that the representation is accurate, efficient, editable, and useful for downstream reasoning. We study implicit surfaces, point clouds, meshes, neural fields, and Gaussian splatting as complementary scene representations, with emphasis on scalability, photorealism, and interfaces that support editing, querying, and fusion across views and modalities.

Scene understanding and dynamic 3D perception

Geometry alone is not enough. AI systems must understand what objects are present, how they are arranged, which parts belong together, what functions they serve, and how they interact with people and other objects. 3D scene understanding connects spatial structure with semantics, affordances, physical constraints, and task-level reasoning. The real world is not static: objects move, people act, cameras change viewpoint, and environments evolve over time. A central challenge is to build 3D systems that model dynamic scenes, maintain temporal consistency, track changes, and reason about motion and interaction in space—linking reconstruction with recognition, grounding, and long-horizon spatial memory.

Generation, embodiment, and world modeling

3D content is becoming a core medium for design, simulation, games, virtual production, augmented reality, and digital environments. Future systems should generate and edit 3D assets, objects, scenes, and worlds with controllable geometry, appearance, semantics, and physical plausibility. 3D Vision is equally essential for agents that act in the world: robots, embodied assistants, and interactive systems need spatial maps, object-level understanding, navigation, manipulation awareness, and prediction of how actions change the environment. Long-term world models must capture not only appearance, but structure, evolution, and interaction—supporting simulation, planning, digital twins, physical reasoning, and human-centered applications from AR and spatial computing to cultural heritage, medical visualization, and assistive systems.

More concretely, our 3D Vision agenda includes the following technical directions:

  • 3D reconstruction and spatial representation — Recovering scene, object, and environment structure from visual observations; geometry, depth, surfaces, volumes, camera motion, and multi-view consistency; representations that are accurate, efficient, editable, and useful for downstream reasoning.
  • 3D scene understanding — Objects, layout, parts, functions, and interactions; connecting spatial structure with semantics, affordances, physical constraints, and task-level reasoning.
  • Dynamic 3D perception — Modeling non-static scenes, temporal consistency, tracking, and motion or interaction reasoning as viewpoints and environments evolve.
  • 3D generation and controllable creation — Generating and editing assets, objects, scenes, and worlds with controllable geometry, appearance, semantics, and physical plausibility; links to generative AI and creative tools.
  • Embodied AI and robotics — Spatial maps, object understanding, navigation, manipulation awareness, and action-conditioned prediction as a bridge between perception and action.
  • World modeling and simulation — Representations that capture structure, evolution, and agent interaction for simulation, planning, digital twins, and interactive AI.
  • Human-centered spatial intelligence — Interpretable, controllable spatial representations for AR, spatial computing, immersive communication, cultural heritage, medical visualization, architecture, education, and assistive systems.

Our long-term ambition is to build AI systems that can perceive and reason about the world in space. Such systems should understand geometry, semantics, motion, interaction, and physical structure as parts of a unified representation. They should reconstruct the real world, generate new 3D content, simulate possible futures, and support agents that act in physical and virtual environments. In this framing, 3D Vision is not merely a collection of reconstruction tasks. It is a strategic foundation for spatial intelligence—connecting geometry, perception, generation, simulation, and embodied action so that the next generation of AI systems does not only see the world, but understands where things are, how they relate, how they change, and how intelligent agents can operate within them.

In Cooperation With

Projects & Demo

COM4D

COM4D

Compositional 4D scene reconstruction from monocular video — CVPR 2026.

Jun 15, 2026

SceneSplat

SceneSplat

Open-vocabulary 3DGS scene understanding with vision-language pretraining — ICCV 2025 Oral.

Oct 13, 2025

Publications

2026 · CVPR

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, Danda Pani Paudel

2026 · CVPR

ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction

Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, Kailun Yang

2026 · CVPR

Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel

2026 · CVPR

ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

Liming Kuang, Dani Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam

2026 · CVPR

Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding

Yue Li, Qi Ma, Runyi Yang, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Theo Gevers, Luc Van Gool, Danda Pani Paudel, Martin R. Oswald

2025 · NeurIPS

GaussianWorld: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting

Mengjiao Ma, Qi Ma, Yue Li, Jiahuan Cheng, Runyi Yang, Bin Ren, Nikola Popovic, Mingqiang Wei, Nicu Sebe, Ender Konukoglu, Luc Van Gool, Theo Gevers, Martin R. Oswald, Danda Pani Paudel

2025 · BMVC

Occam’s LGS: An Efficient Approach for Language Gaussian Splatting

Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, Danda Pani Paudel

2025 · ICCV

SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining

Yue Li, Qi Ma, Runyi Yang, Huapeng Li, Mengjiao Ma, Bin Ren, Nikola Popovic, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Martin R. Oswald, Danda Pani Paudel

2025 · ICCV

Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description

Anna-Maria Halacheva, Yang Miao, Jan-Nico Zaech, Xi Wang, Luc Van Gool, Danda Pani Paudel

2025 · ICCV

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Yung-Hsu Yang, Luigi Piccinelli, Mattia Segu, Siyuan Li, Rui Huang, Yuqian Fu, Marc Pollefeys, Hermann Blum, Zuria Bauer

2025 · CVPR

UniK3D: Universal Camera Monocular 3D Estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, Luc Van Gool

2025 · CVPRW

Splat-SLAM: Globally Optimized RGB-only SLAM with 3D Gaussians

Erik Sandström, Ganlin Zhang, Keisuke Tateno, Michael Oechsle, Michael Niemeyer, Youmin Zhang, Manthan Patel, Luc Van Gool, Martin Oswald, Federico Tombari

2025 · CVPR

PBR-NeRF: Inverse Rendering with Physics-Based Neural Fields

Sean Wu, Shamik Basu, Tim Broedermann, Luc Van Gool, Christos Sakaridis

2025 · CVPR

One2Any: One-Reference 6D Pose Estimation for Any Object

Mengya Liu, Siyuan Li, Ajad Chhatkuli, Prune Truong, Luc Van Gool, Federico Tombari

2025 · CVPR

GaussianVLM: Scene-centric 3D Vision-Language Models using Language-aligned Gaussian Splats for Embodied Reasoning and Beyond

Anna-Maria Halacheva, Jan-Nico Zaech, Xi Wang, Danda Pani Paudel, Luc Van Gool

2025 · CVPRW

Camera-Only 3D Panoptic Scene Completion for Autonomous Driving through Differentiable Object Shapes

Nicola Marinello, Simen Cassiman, Jonas Heylen, Marc Proesmans, Luc Van Gool

2025 · 3DIMPVT

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Daiwei Zhang, Gengyan Li, Jiajie Li, Mickaël Bressieux, Otmar Hilliges, Marc Pollefeys, Luc Van Gool, Xi Wang

2025 · 3DV

Mipmap-GS: Let Gaussians Deform with Scale-Specific Mipmap for Anti-Aliasing Rendering

Jiameng Li, Yue Shi, Jiezhang Cao, Bingbing Ni, Wenjun Zhang, Kai Zhang, Luc Van Gool

2025 · 3DV

A Large-Scale Dataset of Gaussian Splats and Their Self-Supervised Pretraining

Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, Danda Pani Paudel

2025 · WACV

Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding

Sombit Dey, Ozan Unal, Christos Sakaridis, Luc Van Gool

2024 · NeurIPS

Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes

Qi Ma, Danda Pani Paudel, Ender Konukoglu, Luc Van Gool

2024 · IJCV

Neural Vector Fields for Implicit Surface Representation and Inference

Edoardo Mello Rella, Ajad Chhatkuli, Ender Konukoglu & Luc Van Gool

2024 · ACCV

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-supervised Learning

Bin Ren, Guofeng Mei, Danda Pani Paudel, Weijie Wang, Yawei Li, Mengyuan Liu, Rita Cucchiara, Luc Van Gool & Nicu Sebe

2024 · IROS

Ternary-type Opacity and Hybrid Odometry for RGB-only NeRF-SLAM

Junru Lin, Asen Nachkov, Songyou Peng, Luc Van Gool, Danda Pani Paudel

2024 · IROS

Ternary-Type Opacity and Hybrid Odometry for RGB NeRF-SLAM

Junru Lin, Asen Nachkov, Songyou Peng, Luc Van Gool, Danda Pani Paudel

2024 · ECCV

Self-supervised Shape Completion via Involution and Implicit Correspondences

Mengya Liu, Ajad Chhatkuli, Janis Postels, Luc Van Gool & Federico Tombari

2024 · ECCVW

ROMEO: Revisiting Optimization Methods for Reconstructing 3D Human-Object Interaction Models From Images

Alexey Gavryushin, Yifei Liu, Daoji Huang, Yen-Ling Kuo, Julien Valentin, Luc Van Gool, Otmar Hilliges & Xi Wang

2024 · ECCV

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Ozan Unal, Christos Sakaridis, Suman Saha, Luc Van Gool:

2024 · ECCV

Bayesian Self-Training for Semi-Supervised 3D Segmentation

Ozan Unal, Christos Sakaridis, Luc Van Gool

2024 · ICML

Stereo Risk: A Continuous Modeling Approach to Stereo Matching

Ce Liu, Suryansh Kumar, Shuhang Gu, Radu Timofte, Yao Yao, Luc Van Gool

2024 · CVPR

UniDepth: Universal Monocular Metric Depth Estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, Fisher Yu

2024 · CVPR

Rethinking Few-shot 3D Point Cloud Semantic Segmentation

Zhaochong An, Guolei Sun, Yun Liu, Fayao Liu, Zongwei Wu, Dan Wang, Luc Van Gool, Serge Belongie

2024 · CVPR

Loopy-SLAM: Dense Neural SLAM with Loop Closures

Lorenzo Liso, Erik Sandström, Vladimir Yugay, Luc Van Gool, Martin R. Oswald

2024 · CVPR

Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning

Rui Li, Tobias Fischer, Mattia Segu, Marc Pollefeys, Luc Van Gool, Federico Tombari

2024 · CVPR

HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud

Wencan Cheng, Hao Tang, Luc Van Gool, Jong Hwan Ko

2024 · CVPR

Continuous Pose for Monocular Cameras in Neural Implicit Representation

Qi Ma, Danda Paudel, Ajad Chhatkuli, Luc Van Gool

2024 · WACVW

2D Feature Distillation for Weakly- and Semi-Supervised 3D Semantic Segmentation

Ozan Unal, Dengxin Dai, Lukas Hoyer, Yigit Baran Can, Luc Van Gool

2023 · NeurIPS

LART: Neural Correspondence Learning with Latent Regularization Transformer for 3D Motion Transfer

Haoyu Chen, Hao Tang, Radu Timofte, Luc Van Gool, Guoying Zhao

2023 · NeurIPS

Autodecoding Latent 3D Diffusion Models

Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc Van Gool, Sergey Tulyakov

2023 · ICCV

Surface Normal Clustering for Implicit Representation of Manhattan Scenes

Nikola Popovic, Danda Pani Paudel, Luc Van Gool

2023 · ICCV

Deformable Neural Radiance Fields using RGB and Event Cameras

Qi Ma, Danda Pani Paudel, Ajad Chhatkuli, Luc Van Gool