Skip to content

Consistent Autoregressive Video Generation with Long Context

Notifications You must be signed in to change notification settings

TIGER-AI-Lab/Context-Forcing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Paper | Project Page

Official implementation of Context Forcing: Consistent Autoregressive Video Generation with Long Context

Shuo Chen*, Cong Wei*, Sun Sun, Ping Nie, Kai Zhou, Ge Zhang, Ming-Hsuan Yang, Wenhu Chen

Abstract: Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical student-teacher mismatch: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose Context Forcing, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a Slow-Fast Memory architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds—$2\text{--}10\times$ longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

Training paradigms for AR video diffusion models

(a) Self-forcing: A student matches a teacher capable of generating only 5s video using a 5s self-rollout. (b) Longlive: The student performs long rollouts supervised by a memoryless 5s teacher on random chunks. The teacher's inability to see beyond its 5s window creates a student-teacher mismatch. (c) Context Forcing (Ours): The student is supervised by a long-context teacher aware of the full generation history, resolving the mismatch in (b).

Context Forcing and Context Management System

We use KV Cache as the context memory, and we organize it into three parts: sink, slow memory and fast memory. During contextual DMD training, the long teacher provides supervision to the long student by utilizing the same context memory mechanism.

Project Updates

  • 🔥🔥 News: 2026/2/5: Arxiv paper and project page released.

Todo List

  • [] Opensourse inference code and checkpoints. Refactoring the codebase; will be open-sourced shortly.
  • [] Opensourse training code.

Acknowledgement

We would like to thank the following work for their exceptional effort.

Citation

If you find this codebase useful for your research, please kindly cite our paper:

@misc{chen2026contextforcingconsistentautoregressive,
      title={Context Forcing: Consistent Autoregressive Video Generation with Long Context}, 
      author={Shuo Chen and Cong Wei and Sun Sun and Ping Nie and Kai Zhou and Ge Zhang and Ming-Hsuan Yang and Wenhu Chen},
      year={2026},
      eprint={2602.06028},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.06028}, 
}

About

Consistent Autoregressive Video Generation with Long Context

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •