Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

https://news.ycombinator.com/rss Hits: 26

Summary

A next-frame (or next-frame-section) prediction model looks like this: So we have many input frames and want to diffuse some new frames. The idea is that we can encode the input frames to some GPU layout like this: This chart shows the logical GPU memory layout - frames images are not stitched. Or, say the context length of each input frame. Each frame is encoded with different patchifying kernel to achieve this. For example, in HunyuanVideo, a 480p frame is likely 1536 tokens if using (1, 2, 2) patchifying kernel. Then, if changed to (2, 4, 4) patchifying kernel, a frame is 192 tokens. In this way, we can change the context length of each frame. The "more important" frames are given more GPU resources (context length) - in this example, F0 is the most important as it is the nearest frame to the "next-frame prediction" target. This is O(1) computation complexity for streaming - Yes, a constant, not even O(nlogn) or O(n).

First seen: 2025-04-19 14:20

Last seen: 2025-04-20 15:24

Read Full Article More from this Source

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Summary

Related News

Visual Transistor-level Simulation of the 6502 CPU

How a Pipe Organ Works

TmuxAI: AI-Powered, Non-Intrusive Terminal Assistant

Cut: Chattanooga Civic User Testing

Show HN: I created snapDOM to capture DOM nodes as images with exceptional speed