Coding Neon Kernels for the Cortex-A53

https://news.ycombinator.com/rss Hits: 2
Summary

Some weeks ago, I presented at FOSDEM my work-in-progress high performance SDR runtime qsdr. I showed a hand-written NEON assembly implementation of a kernel that computes \(y[n] = ax[n] + b\), which I used as the basic math block for benchmarks on a Kria KV260 board (which has a quad-core ARM Cortex-A53 at 1.33 GHz). In that talk I glossed over the details of how I implemented this NEON kernel. There are enough tricks and considerations that I could make a full talk just out of explaining how to write this kernel. This will be the topic for this post. Note: this post assumes familiarity with the aarch64 assembly syntax, particularly with the way that NEON registers are denoted depending on the context. For example, you should understand that v0.4s and q0 refer to the same 128-bit NEON register, and v0.2s and d0 denote the 64 LSBs of this same NEON register. It might be worth to review the syntax if this doesn’t make sense immediately to you. Cortex-A53 characteristics Something peculiar about the Cortex-A53 is that documentation explaining its instructing timing is not publicly available. There is, for instance, a Cortex-A57 Software Optimization Guide, but the equivalent document for the Cortex-A53 is not available. I believe that it exists, but it is only available under an NDA. Therefore, all the wisdom about how to optimize code for the Cortex-A53 is folklore coming from people that have either reverse-engineered instruction timings by performing micro-benchmarks, or which have had access to this NDA documentation. Here are some links that I have found quite useful when starting to understand how the Cortex-A53 works and what tricks can be used: Nevertheless, this documentation has not been sufficient to understand fully how this CPU works. I have written many instruction timing micro-benchmarks to verify the claims in these links and to test how other combinations of instructions work. The Cortex-A53 is an in-order execution CPU with partial dual-issue capabil...

First seen: 2025-04-21 17:36

Last seen: 2025-04-21 18:36