O A S I S

Towards Redundancy Reduction in Diffusion Models
for Efficient Video Super-Resolution

Jinpei Guo1, 2,   Yifei Ji2,   Zheng Chen2,   Yufei Wang3,   Sizhuo Ma3,   Yong Guo4,   Yulun Zhang2*†,   Jian Wang3†,  
1Carnegie Mellon University, 2Shanghai Jiao Tong University, 3Snap Inc., 4South China University of Technology
*Corresponding author,   Equal advising

Real-World Degradations (upscale ×4)

Synthetic Degradations (upscale ×4)


OASIS is an efficient one-step diffusion model with attention
specialization for real-world video super-resolution.

Abstract

Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient an efficient one-step diffusion model with attention specialization for real-world video super-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a 6.2× speedup over one-step diffusion baselines such as SeedVR2.

Method

OASIS incorporates an attention specialization routine (ASR) to mitigate redundancy while improving performance. Given an input low-resolution video, we first map it into the latent space via pixel-unshuffle, and then process it with a diffusion transformer equipped with ASR. ASR divides attention heads into global, intra-frame, and window groups to capture complementary contexts, with grouping based on the KL-divergence between their localized and global attention distributions. Their outputs are concatenated into an aggregated feature, and a VAE decoder reconstructs the video from the restored latent.

Comparison with SOTA

Qualitative Comparison

  • Slide 0
  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
LQ
VEnhancer
STAR
SeedVR
SeedVR2
OASIS (ours)
LQ
VEnhancer
STAR
SeedVR
SeedVR2
OASIS (ours)
LQ
VEnhancer
STAR
SeedVR
SeedVR2
OASIS (ours)
LQ
VEnhancer
STAR
SeedVR
SeedVR2
OASIS (ours)
LQ
VEnhancer
STAR
SeedVR
SeedVR2
OASIS (ours)

BibTeX

@article{guo2025towards,
    title={Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution},
    author={Guo, Jinpei and Ji, Yifei and Chen, Zheng and Wang, Yufei and Ma, Sizhuo and Guo, Yong and Zhang, Yulun and Wang, Jian},
    journal={arXiv preprint arXiv:2509.23980},
    year={2025}
}