Research direction

Visual Media

Processing, Editing, and Generating the Visual World

Topic id
visual-media

Images and videos are no longer only records of the physical world. They are becoming the primary interface through which people communicate, create, simulate, learn, and interact with intelligent systems. At the same time, visual content is entering a new stage: it must be captured under imperfect real-world conditions, restored with high fidelity, edited with precise human control, generated with semantic and physical consistency, and eventually used as a medium for modeling dynamic worlds. This makes Visual Media a foundational research area that connects low-level vision, generative modeling, multimodal intelligence, creative tools, and real-world AI applications.

Our vision is to build the next generation of visual media technologies along three tightly connected pillars: Media Processing, Media Editing, and Media Generation. These are not separate topics, but different levels of the same problem. Media Processing asks how to recover, enhance, and understand imperfect visual signals. Media Editing asks how to modify existing visual content while preserving identity, structure, style, and user intent. Media Generation asks how to synthesize new visual worlds from language, reference images, layouts, videos, and multimodal conditions. Together, they form a complete pipeline from visual signal to controllable creation.

Media Processing

Media Processing remains a fundamental layer of Visual Media research. Real-world images and videos are often degraded by noise, blur, compression, low resolution, poor lighting, motion, weather, sensor limitations, and temporal instability. These degradations are not merely cosmetic problems. They affect downstream perception, creative reuse, scientific analysis, cultural preservation, autonomous systems, and human communication. High-quality restoration and enhancement therefore serve as the entry point for trustworthy visual intelligence. Our work in this direction studies how to combine classical signal fidelity, learned generative priors, temporal coherence, perceptual quality, and real-world robustness. The goal is to move beyond benchmark-specific restoration toward systems that are fast, faithful, controllable, and useful in real deployment scenarios, from mobile imaging and video streaming to archival restoration and professional production.

Restored output Degraded input INPUT RESTORED
INPUT RESTORED

Media Editing

Media Editing is the bridge between understanding and creation. Editing is generation under strong constraints: the system must understand what the user wants to change, what must remain untouched, and how the edit should respect the original image or video. This requires precise control over geometry, identity, lighting, motion, texture, style, and temporal consistency. It also requires interfaces that allow humans to specify intent through language, examples, sketches, masks, timelines, or multimodal instructions. We view editing as a key step toward practical visual intelligence because most real creative workflows are not purely generative. They are iterative, conditional, and human-in-the-loop. A powerful visual media system should not only produce beautiful content from scratch, but also revise, repair, extend, localize, and refine existing content with professional-level control.

INPUT + MASK EDITED

Media Generation

Media Generation represents a deeper transformation. Image and video generation are becoming a new form of media, not just a tool for producing assets. Video generation, in particular, is moving toward the ability to synthesize temporally coherent scenes, persistent characters, realistic motion, physical interactions, camera movements, and eventually interactive environments. This changes the role of video from a passive recording medium to an active generative medium. Future video models may support filmmaking, advertising, education, simulation, gaming, robotics, virtual production, digital twins, and scientific visualization. More importantly, they may serve as a basis for learning world models: systems that do not merely generate pixels, but learn how objects move, how agents act, how scenes evolve, and how physical and social dynamics unfold over time.

Image generation example

More concretely, our Visual Media agenda includes the following technical directions:

  • Image and video processing — Restoration, super-resolution, deblurring, denoising, low-light enhancement, and video quality improvement, with emphasis on input fidelity, perceptual quality, temporal stability, and real-world robustness.
  • Image and video generation — Diffusion, flow-based, transformer, and latent generative models for realistic, diverse, temporally coherent, and efficient visual synthesis.
  • Controllable generation — Synthesis conditioned on masks, sketches, depth, pose, layouts, camera paths, motion trajectories, reference images, identity, and style, toward precise and composable creative workflows.
  • Multimodal generation and editing — Joint reasoning over language, images, video, audio, and user feedback, including language-guided editing, reference-based generation, and interactive refinement.
  • Quality assessment and perceptual evaluation - Image, video, and multimodal quality assessment for restoration, enhancement, generation, and editing, covering fidelity, realism, aesthetics, temporal consistency, semantic correctness, instruction alignment, and human preference modeling.
  • Agentic media creation — AI systems that decompose visual tasks, select tools, plan editing steps, verify results, and iteratively improve outputs as creative assistants.
  • World models and generative simulation — Models that capture object permanence, spatial structure, physical interaction, causal dynamics, camera motion, and long-horizon scene evolution.
  • Diffusion theory and generative modeling foundations — Theory and empirics of sampling, controllability, stability, failure modes, and the interplay between generative priors and input fidelity.
  • Real-world deployment and creative applications — Mobile imaging, content creation, virtual production, education, cultural heritage, robotics simulation, and interactive AI tools, with attention to speed, latency, evaluation, and pipeline integration.

Our long-term ambition is to develop visual media systems that are faithful to real-world signals, controllable by humans, coherent over time, grounded in physical and semantic structure, and creative enough to expand the boundary of visual communication. By connecting media processing, editing, generation, multimodal intelligence, diffusion theory, agentic creation, and world modeling, we aim to build a unified research program for the next generation of visual intelligence.

In Cooperation With

Projects & Demo

VOID

VOID

Video object and interaction deletion — INSAIT × Netflix.

Apr 8, 2026

Publications

2026 · ICML

Enhanced Latent-Space Adversarial Training for Super-Resolution

Liangbin Xie, Zheyuan Li, Fanghua Yu, Xinqi Lin, Jun-hao Zhuang, Jinfan Hu, Jinjin Gu, Jiantao Zhou, Chao Dong

2026 · CVPR

Universal Computational Aberration Correction: A Comprehensive Benchmark Analysis

Xiaolong Qian, Qi Jiang, Yao Gao, Lei Sun, Zhonghua Yi, Kailun Yang, Luc Van Gool, Kaiwei Wang

2026 · CVPR

Rethinking Image Evaluation in Super-Resolution

Shaolin Su, Josep M. Rocafort, David Serrano-Lozano, Lei Sun, Danna Xue, Javier Vazquez-Corral

2026 · CVPR

PhotoFramer: Multi-modal Image Composition Instruction

Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang

2026 · CVPR

Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal

Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang, Peixuan Wu, Jiacheng Zhou, Yao Gao, Yaoguang Ma, Ming-Hsuan Yang, Kaiwei Wang

2026 · CVPR

Intelligent Photo Retouching with Language Model-Based Artist Agents

Haoyu Chen, Keda Tao, Yizao Wang, Xinlei Wang, Lei Zhu, Jinjin Gu

2026 · CVPR

How Far Have We Gone in Generative Image Restoration? A Study on Its Capability, Limitations and Evaluation Practices

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

2026 · ICLR

Rethinking Expressivity and Degradation-Awareness in Attention for All-in-One Blind Image Restoration

Bin Ren, Runyi Yang, Qi Ma, Xu Zheng, Mengyuan Liu, Danda Pani Paudel, Luc Van Gool, Rita Cucchiara, Nicu Sebe

2026 · ICLR

Efficient Degradation-agnostic Image Restoration via Channel-Wise Functional Decomposition and Manifold Regularization

Bin Ren, Yawei Li, Xu Zheng, Yuqian Fu, Danda Pani Paudel, Hong Liu, Ming-Hsuan Yang, Luc Van Gool, Nicu Sebe

2026 · WACV

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos

2026 · PR

Revisiting the Generalization Problem of Low-level Vision Models Through the Lens of Image Deraining

Jinfan Hu, Zhiyuan You, Jinjin Gu, Kaiwen Zhu, Tianfan Xue, Chao Dong

2026 · TMM

Density-Aware Video Desnowing with Robust Alignment on a Large-Scale Dataset

Haoyu Chen, Jingjing Ren, Jiaxing Shen, Sixiang Chen, Jinjin Gu, Ping Tan, Lei Zhu

2025 · CVIU

When super-resolution meets camouflaged object detection: A comparison study

Juan Wen, Shupeng Cheng, Weiyan Hou, Luc Van Gool, Radu Timofte

2025 · TPAMI

Test-Time Training for Hyperspectral Image Super-Resolution

Ke Li, Luc Van Gool, Dengxin Dai

2025 · TPAMI

Spatial-Temporal Graph Mamba for Music-Guided Dance Video Synthesis

Hao Tang, Ling Shao, Zhenyu Zhang, Luc Van Gool, Nicu Sebe

2025 · SIGGRAPH Asia

Harnessing Diffusion-Yielded Score Priors for Image Restoration

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S. Ren, Jinjin Gu, Chao Dong

2025 · TPAMI

Enhanced Multi-Scale Cross-Attention for Person Image Generation

Hao Tang, Ling Shao, Nicu Sebe, Luc Van Gool

2025 · ICCV

Low-Light Image Enhancement using Event-Based Illumination Estimation

Lei Sun, Yuhan Bao, Jiajun Zhai, Jingyun Liang, Yulun Zhang, Kaiwei Wang, Danda Pani Paudel, Luc Van Gool

2025 · CVPRW

The Tenth NTIRE 2025 Image Denoising Challenge Report

Lei Sun, Hang Guo, Bin Ren, Luc Van Gool, Radu Timofte, Yawei Li

2025 · CVPR

Complexity Experts are Task-Discriminative Learners for Any Image Restoration

Eduard Zamfir, Zongwei Wu, Nancy Mehta, Yuedong Tan, Danda Pani Paudel, Yulun Zhang, Radu Timofte

2024 · NeurIPS

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe

2024 · TPAMI

A Unified Framework for Event-Based Frame Interpolation With Ad-Hoc Deblurring in the Wild

Lei Sun, Daniel Gehrig, Christos Sakaridis, Mathias Gehrig, Jingyun Liang, Peng Sun, Zhijie Xu, Kaiwei Wang, Luc Van Gool

2024 · ECCV

MoVideo: Motion-Aware Video Generation with Diffusion Models

Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan

2024 · ICML

Lightweight Image Super-Resolution via Flexible Meta Pruning

Yulun Zhang, Kai Zhang, Luc Van Gool, Martin Danelljan, Fisher Yu

2024 · CVPRW

Towards Online Real-Time Memory-based Video Inpainting Transformers

Guillaume Thiry, Hao Tang, Radu Timofte, Luc Van Gool

2024 · CVPR

Real-World Mobile Image Denoising Dataset with Efficient Baselines

Roman Flepp, Andrey Ignatov, Radu Timofte, Luc Van Gool

2024 · CVPR

ExtDM: Dual Distribution Extrapolation Diffusion Model for Video Prediction

Zhicheng Zhang, Junyao Hu, Wentao Cheng, Danda Paudel, Jufeng Yang

2024 · CVPR

Deep Equilibrium Diffusion Restoration with Parallel Sampling

Jiezhang Cao, Yue Shi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc Van Gool

2023 · ICCV

PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face inpainting

Saman Motamed, Jianjin Xu, Chen Henry Wu, Christian Hane, Jean-Charles Bazin, Fernando de la Torre