3D Vision
Spatial intelligence for the physical world—reconstruction, understanding, generation, and embodied interaction.
3D Vision is a foundational direction for building AI systems that understand, reconstruct, generate, and interact with the physical world. While 2D vision focuses on images and videos as visual observations, 3D vision asks a deeper question: how can machines recover the structure of the world behind visual signals, reason about space and geometry, and use that understanding to support perception, creation, simulation, and action? This makes 3D Vision a strategic research area that connects computer vision, graphics, robotics, embodied AI, augmented reality, digital twins, and world modeling.
The importance of 3D Vision comes from the fact that intelligence in the real world is inherently spatial. Humans do not only recognize objects. We understand where things are, how they are shaped, how they relate to each other, how they can be manipulated, and how scenes change when we move through them. Future AI systems will need similar spatial competence. They must build persistent representations of environments, reason across viewpoints, infer hidden structure from partial observations, understand object geometry and scene layout, and support interaction in both physical and virtual spaces. In this sense, 3D Vision provides the geometric foundation for visual intelligence.
Our vision is to develop 3D visual systems that move beyond passive reconstruction toward spatial understanding, generative modeling, and embodied interaction. A strong 3D system should not only reconstruct what is visible, but also infer what is occluded, align perception across time and viewpoints, understand semantic and functional relationships, and represent the world in a way that can support reasoning and action. This requires models that combine geometry, appearance, semantics, dynamics, and physical structure in a unified spatial representation. At a strategic level, our work connects reconstruction and representation, scene understanding over time, controllable 3D creation, and embodied world models into one research program.
Reconstruction and spatial representation
The first goal of 3D Vision is to recover the structure of scenes, objects, and environments from visual observations. This includes reasoning about geometry, depth, surfaces, volumes, camera motion, and multi-view consistency. More broadly, it asks how visual systems should represent the world so that the representation is accurate, efficient, editable, and useful for downstream reasoning. We study implicit surfaces, point clouds, meshes, neural fields, and Gaussian splatting as complementary scene representations, with emphasis on scalability, photorealism, and interfaces that support editing, querying, and fusion across views and modalities.
Scene understanding and dynamic 3D perception
Geometry alone is not enough. AI systems must understand what objects are present, how they are arranged, which parts belong together, what functions they serve, and how they interact with people and other objects. 3D scene understanding connects spatial structure with semantics, affordances, physical constraints, and task-level reasoning. The real world is not static: objects move, people act, cameras change viewpoint, and environments evolve over time. A central challenge is to build 3D systems that model dynamic scenes, maintain temporal consistency, track changes, and reason about motion and interaction in space—linking reconstruction with recognition, grounding, and long-horizon spatial memory.
Generation, embodiment, and world modeling
3D content is becoming a core medium for design, simulation, games, virtual production, augmented reality, and digital environments. Future systems should generate and edit 3D assets, objects, scenes, and worlds with controllable geometry, appearance, semantics, and physical plausibility. 3D Vision is equally essential for agents that act in the world: robots, embodied assistants, and interactive systems need spatial maps, object-level understanding, navigation, manipulation awareness, and prediction of how actions change the environment. Long-term world models must capture not only appearance, but structure, evolution, and interaction—supporting simulation, planning, digital twins, physical reasoning, and human-centered applications from AR and spatial computing to cultural heritage, medical visualization, and assistive systems.
More concretely, our 3D Vision agenda includes the following technical directions:
- 3D reconstruction and spatial representation — Recovering scene, object, and environment structure from visual observations; geometry, depth, surfaces, volumes, camera motion, and multi-view consistency; representations that are accurate, efficient, editable, and useful for downstream reasoning.
- 3D scene understanding — Objects, layout, parts, functions, and interactions; connecting spatial structure with semantics, affordances, physical constraints, and task-level reasoning.
- Dynamic 3D perception — Modeling non-static scenes, temporal consistency, tracking, and motion or interaction reasoning as viewpoints and environments evolve.
- 3D generation and controllable creation — Generating and editing assets, objects, scenes, and worlds with controllable geometry, appearance, semantics, and physical plausibility; links to generative AI and creative tools.
- Embodied AI and robotics — Spatial maps, object understanding, navigation, manipulation awareness, and action-conditioned prediction as a bridge between perception and action.
- World modeling and simulation — Representations that capture structure, evolution, and agent interaction for simulation, planning, digital twins, and interactive AI.
- Human-centered spatial intelligence — Interpretable, controllable spatial representations for AR, spatial computing, immersive communication, cultural heritage, medical visualization, architecture, education, and assistive systems.
Our long-term ambition is to build AI systems that can perceive and reason about the world in space. Such systems should understand geometry, semantics, motion, interaction, and physical structure as parts of a unified representation. They should reconstruct the real world, generate new 3D content, simulate possible futures, and support agents that act in physical and virtual environments. In this framing, 3D Vision is not merely a collection of reconstruction tasks. It is a strategic foundation for spatial intelligence—connecting geometry, perception, generation, simulation, and embodied action so that the next generation of AI systems does not only see the world, but understands where things are, how they relate, how they change, and how intelligent agents can operate within them.
In Cooperation With