Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

https://news.ycombinator.com/rss Hits: 26
Summary

A next-frame (or next-frame-section) prediction model looks like this: So we have many input frames and want to diffuse some new frames. The idea is that we can encode the input frames to some GPU layout like this: This chart shows the logical GPU memory layout - frames images are not stitched. Or, say the context length of each input frame. Each frame is encoded with different patchifying kernel to achieve this. For example, in HunyuanVideo, a 480p frame is likely 1536 tokens if using (1, 2, 2) patchifying kernel. Then, if changed to (2, 4, 4) patchifying kernel, a frame is 192 tokens. In this way, we can change the context length of each frame. The "more important" frames are given more GPU resources (context length) - in this example, F0 is the most important as it is the nearest frame to the "next-frame prediction" target. This is O(1) computation complexity for streaming - Yes, a constant, not even O(nlogn) or O(n).

First seen: 2025-04-19 14:20

Last seen: 2025-04-20 15:24