Visual Media

Images and videos are no longer only records of the physical world. They are becoming the primary interface through which people communicate, create, simulate, learn, and interact with intelligent systems. At the same time, visual content is entering a new stage: it must be captured under imperfect real-world conditions, restored with high fidelity, edited with precise human control, generated with semantic and physical consistency, and eventually used as a medium for modeling dynamic worlds. This makes Visual Media a foundational research area that connects low-level vision, generative modeling, multimodal intelligence, creative tools, and real-world AI applications.

Our vision is to build the next generation of visual media technologies along three tightly connected pillars: Media Processing, Media Editing, and Media Generation. These are not separate topics, but different levels of the same problem. Media Processing asks how to recover, enhance, and understand imperfect visual signals. Media Editing asks how to modify existing visual content while preserving identity, structure, style, and user intent. Media Generation asks how to synthesize new visual worlds from language, reference images, layouts, videos, and multimodal conditions. Together, they form a complete pipeline from visual signal to controllable creation.

Media Processing

Media Processing remains a fundamental layer of Visual Media research. Real-world images and videos are often degraded by noise, blur, compression, low resolution, poor lighting, motion, weather, sensor limitations, and temporal instability. These degradations are not merely cosmetic problems. They affect downstream perception, creative reuse, scientific analysis, cultural preservation, autonomous systems, and human communication. High-quality restoration and enhancement therefore serve as the entry point for trustworthy visual intelligence. Our work in this direction studies how to combine classical signal fidelity, learned generative priors, temporal coherence, perceptual quality, and real-world robustness. The goal is to move beyond benchmark-specific restoration toward systems that are fast, faithful, controllable, and useful in real deployment scenarios, from mobile imaging and video streaming to archival restoration and professional production.

INPUT RESTORED

Media Editing

Media Editing is the bridge between understanding and creation. Editing is generation under strong constraints: the system must understand what the user wants to change, what must remain untouched, and how the edit should respect the original image or video. This requires precise control over geometry, identity, lighting, motion, texture, style, and temporal consistency. It also requires interfaces that allow humans to specify intent through language, examples, sketches, masks, timelines, or multimodal instructions. We view editing as a key step toward practical visual intelligence because most real creative workflows are not purely generative. They are iterative, conditional, and human-in-the-loop. A powerful visual media system should not only produce beautiful content from scratch, but also revise, repair, extend, localize, and refine existing content with professional-level control.

INPUT + MASK EDITED

Media Generation

Media Generation represents a deeper transformation. Image and video generation are becoming a new form of media, not just a tool for producing assets. Video generation, in particular, is moving toward the ability to synthesize temporally coherent scenes, persistent characters, realistic motion, physical interactions, camera movements, and eventually interactive environments. This changes the role of video from a passive recording medium to an active generative medium. Future video models may support filmmaking, advertising, education, simulation, gaming, robotics, virtual production, digital twins, and scientific visualization. More importantly, they may serve as a basis for learning world models: systems that do not merely generate pixels, but learn how objects move, how agents act, how scenes evolve, and how physical and social dynamics unfold over time.

More concretely, our Visual Media agenda includes the following technical directions:

Image and video processing — Restoration, super-resolution, deblurring, denoising, low-light enhancement, and video quality improvement, with emphasis on input fidelity, perceptual quality, temporal stability, and real-world robustness.
Image and video generation — Diffusion, flow-based, transformer, and latent generative models for realistic, diverse, temporally coherent, and efficient visual synthesis.
Controllable generation — Synthesis conditioned on masks, sketches, depth, pose, layouts, camera paths, motion trajectories, reference images, identity, and style, toward precise and composable creative workflows.
Multimodal generation and editing — Joint reasoning over language, images, video, audio, and user feedback, including language-guided editing, reference-based generation, and interactive refinement.
Quality assessment and perceptual evaluation - Image, video, and multimodal quality assessment for restoration, enhancement, generation, and editing, covering fidelity, realism, aesthetics, temporal consistency, semantic correctness, instruction alignment, and human preference modeling.
Agentic media creation — AI systems that decompose visual tasks, select tools, plan editing steps, verify results, and iteratively improve outputs as creative assistants.
World models and generative simulation — Models that capture object permanence, spatial structure, physical interaction, causal dynamics, camera motion, and long-horizon scene evolution.
Diffusion theory and generative modeling foundations — Theory and empirics of sampling, controllability, stability, failure modes, and the interplay between generative priors and input fidelity.
Real-world deployment and creative applications — Mobile imaging, content creation, virtual production, education, cultural heritage, robotics simulation, and interactive AI tools, with attention to speed, latency, evaluation, and pipeline integration.

Our long-term ambition is to develop visual media systems that are faithful to real-world signals, controllable by humans, coherent over time, grounded in physical and semantic structure, and creative enough to expand the boundary of visual communication. By connecting media processing, editing, generation, multimodal intelligence, diffusion theory, agentic creation, and world modeling, we aim to build a unified research program for the next generation of visual intelligence.

Media Processing

Media Editing

Media Generation

In Cooperation With

Projects & Demo

People

Publications