Vision Agent

Vision Agent: from understanding to action, from perception to execution

Vision Agent is a core direction for building the next generation of visual intelligence systems. While traditional computer vision has focused on perception, recognition, generation, or editing as individual capabilities, real-world visual tasks often require much more than a single model prediction. They require a system that can understand multimodal inputs, reason about user intent, decompose a complex goal into executable steps, call external tools, inspect intermediate results, recover from errors, and iteratively improve its output. This is the central motivation for Vision Agent: to move from models that can see, describe, and generate, toward systems that can plan, act, verify, and complete complex visual tasks.

Multimodal Foundation Models

The foundation of Vision Agent is the multimodal foundation model. A strong visual agent must first be able to understand and connect language, images, videos, audio, spatial layouts, temporal events, and user feedback. Multimodal models provide the perceptual and reasoning backbone: they parse visual scenes, track objects and actions over time, interpret instructions, ground language in visual evidence, and maintain context across multiple turns. In this sense, multimodality is not a separate topic from agent research. It is the substrate on which visual agents are built. Without robust multimodal understanding, an agent cannot reliably decide what to do, which tool to call, whether a result is correct, or how to revise its plan.

Vision Agent as an Execution Layer

Built on top of this foundation, Vision Agent introduces an execution layer for complex visual tasks. A Vision Agent is not simply a visual-language model wrapped in an interface. It is a closed-loop system that integrates perception, planning, tool use, memory, verification, and refinement. Given a high-level instruction, the agent should be able to analyze the visual context, break the task into subtasks, select appropriate models or software tools, execute each step, evaluate the intermediate and final outputs, and revise the workflow when necessary. This agentic formulation is especially important for tasks that cannot be solved by a single forward pass, such as long-video analysis, multi-step visual reasoning, image and video editing, visual media production, user-interface operation, robotics simulation, and interactive environment understanding.

Agentic AI for Visual Media

A major strategic direction within Vision Agent is Agentic AI for Visual Media. The goal is to construct autonomous creative agents that can perform image and video processing, editing, generation, evaluation, and refinement in a full workflow. Such agents should be able to understand a creative goal, choose among restoration, editing, generation, quality assessment, and layout tools, and iteratively improve the output until it satisfies both visual quality and user intent. This transforms visual media research from building isolated models into building intelligent creative systems. In the long run, these systems may become professional visual assistants for filmmaking, design, advertising, education, digital content creation, virtual production, and interactive media.

More concretely, our Vision Agent agenda includes the following technical directions:

Multimodal foundation models — Unified models for language, images, videos, audio, and interaction, with emphasis on visual grounding, long-video understanding, temporal reasoning, spatial structure, and instruction following.
Visual perception and spatiotemporal understanding — Methods that allow agents to understand objects, actions, scenes, camera motion, temporal events, object permanence, and cross-frame consistency in images, videos, and interactive environments.
Task decomposition and planning — Agent systems that transform high-level user goals into executable workflows, including step-by-step planning, subtask decomposition, plan revision, failure recovery, and long-horizon visual reasoning.
Tool use and visual workflow execution — Agents that can call external tools such as detectors, segmenters, editors, generators, quality assessors, search systems, video tools, GUI tools, and domain-specific software APIs.
Agentic AI for Visual Media — Autonomous creative agents for image and video processing, editing, generation, evaluation, and refinement, with applications in visual content creation, post-production, design, and interactive media.
Verification, evaluation, and self-refinement — Evaluator and verifier models that judge fidelity, quality, temporal consistency, semantic correctness, instruction alignment, and human preference, enabling agents to inspect and improve their own outputs.
Memory and interaction — Multi-turn agent memory for preserving user intent, visual context, editing history, intermediate results, and feedback, enabling interactive and human-in-the-loop visual workflows.
World models and embodied visual agents — Agents that operate not only on static images or videos, but also in simulated, interactive, or physical environments, connecting visual reasoning with action, prediction, and environment dynamics.
Deployment and reliable systems — Engineering frameworks for scalable, observable, reproducible, and safe visual agents, including sandboxed tool execution, logging, workflow replay, latency optimization, and robust failure handling.

Our long-term ambition is to build Vision Agents that are perceptually grounded, multimodally intelligent, tool-using, self-verifying, and capable of executing complex visual workflows with minimal human supervision. By starting from multimodal foundation models and extending them into planning, tool use, verification, and autonomous execution, we aim to develop visual intelligence systems that do not merely understand or generate visual content, but can actively operate on the visual world.