Egocentric Vision

Egocentric Vision studies visual intelligence from the viewpoint of an active observer. Unlike conventional third-person vision, which observes the world from the outside, egocentric vision captures what a person sees, touches, attends to, remembers, and acts upon. This first-person perspective makes it a uniquely important direction for the next generation of AI systems, where perception is no longer separated from action, context, and human intention.

As computing moves from screens to wearable devices, augmented reality, personal assistants, robotics, and embodied agents, visual intelligence must become more situated, continuous, and personalized. Future AI systems will not only analyze isolated images or short videos. They will need to understand a user’s ongoing experience: where the user is, what objects are nearby, what actions are taking place, what has happened before, what the user may want next, and how an intelligent system can assist without disrupting natural activity. Egocentric Vision provides the perceptual foundation for this shift.

The strategic value of Egocentric Vision lies in its ability to connect perception, memory, action, and assistance. A first-person visual stream naturally contains information about human attention, hand-object interaction, physical affordances, spatial layout, temporal continuity, and task context. These signals are essential for building AI systems that can understand daily activities, support real-world decision making, assist users in complex environments, and eventually operate as embodied or wearable agents. In this sense, egocentric vision is not only a subfield of video understanding. It is a foundation for human-centered and action-aware AI.

Our research vision is to develop egocentric visual systems that can understand the world as an intelligent assistant would: continuously, multimodally, contextually, and with awareness of human goals. This requires models that go beyond recognizing objects or actions in individual frames. They must build persistent memory over long time horizons, reason about the relationship between people and objects, infer intention from motion and attention, connect language with first-person experience, and support interaction with users and environments.

We organize Egocentric Vision around several high-level research themes:

First-person perception and multimodal grounding Egocentric systems must understand visual scenes from a moving, partial, and embodied viewpoint. This includes recognizing objects, hands, actions, spatial layout, and environmental context, while grounding language, audio, gaze, motion, and other signals in the first-person visual stream.
Long-horizon memory and temporal understanding First-person experience is continuous. A useful egocentric AI system must remember what has happened, retrieve relevant past events, track object locations over time, understand long activities, and reason over temporal structure beyond short clips.
Human-object interaction and action understanding Egocentric video is especially rich in interaction. It captures how people manipulate objects, perform tasks, make choices, and respond to the environment. Understanding these interactions is central to building systems that can recognize intent, anticipate future actions, and provide timely assistance.
Wearable and assistive intelligence Egocentric Vision is closely tied to wearable computing and assistive AI. The goal is to support systems that can help users navigate environments, retrieve information, understand surroundings, follow procedures, recall past events, and interact naturally through language and visual context.

Embodied AI and world modeling First-person vision provides a natural bridge between visual perception and embodied action. It can help AI systems learn how the world changes through interaction, how objects afford actions, how tasks unfold over time, and how an agent should act in a dynamic environment.
Privacy, personalization, and responsible deployment Because egocentric systems are close to human life, privacy and trust are not secondary concerns. Future systems must learn from personal context while respecting bystanders, minimizing unnecessary data exposure, supporting user control, and enabling safe deployment in real environments.

The long-term ambition of Egocentric Vision is to build AI systems that understand human experience from the inside. Such systems should not merely observe the world, but understand the relationship between the observer and the world: what the user sees, what the user is doing, what the user remembers, and what the user may need next. By connecting first-person perception with multimodal understanding, temporal memory, human-object interaction, and embodied reasoning, Egocentric Vision can become a foundation for wearable assistants, augmented reality, robotics, and human-centered Vision Agents.

In this framing, Egocentric Vision is a strategic research direction because it addresses one of the central challenges of future AI: how to move from passive visual recognition to situated intelligence that can understand, assist, and act in the real world from the human point of view.

In Cooperation With

Projects & Demo

EgoNight

ObjectRelator

People

Publications

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos

EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting

Predicting Actions through Language Models @ Ego4D Long-Term Action Anticipation Challenge 2023

Research

Group

Resources