Native Sparse Attention

https://news.ycombinator.com/rss Hits: 8

Summary

@inproceedings{yuan-etal-2025-native, title = "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention", author = "Yuan, Jingyang and Gao, Huazuo and Dai, Damai and Luo, Junyu and Zhao, Liang and Zhang, Zhengyan and Xie, Zhenda and Wei, Yuxing and Wang, Lean and Xiao, Zhiping and Wang, Yuqing and Ruan, Chong and Zhang, Ming and Liang, Wenfeng and Zeng, Wangding", editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.1126/", pages = "23078--23097", ISBN = "979-8-89176-251-0", abstract = "Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trained Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and i...

First seen: 2025-08-02 03:12

Last seen: 2025-08-02 10:13

Read Full Article More from this Source

Native Sparse Attention

Summary

Related News

The First Widespread Cure for HIV Could Be in Children

Hardening Mode for the Compiler

Ask HN: Who is hiring? (August 2025)

Yearly Organiser

Microsoft is open sourcing Windows 11's UI framework