Anukari on the CPU (part 2: CPU optimization)

https://news.ycombinator.com/rss Hits: 3

Summary

Captain's Log: Stardate 79317.7 In part 1 of this series of posts, I mentioned that I was surprised to find that naively running my GPU code on the CPU was only 5x slower, when I thought it would be 100x slower. In this post I will explain how I ended up making the CPU implementation much faster than on the GPU. First approach: spot-vectorization As mentioned in part 1, I got the original GPU code compiled for the CPU, and then wrote a simple driver to call into this code and run the simulation (in lieu of the code that set up and invoked the GPU kernel). As you might imagine, Anukari, being a 3D physics simulation, does a lot of arithmetic on float3 vectors of the form {x, y, z}. In other words, vectors of three 32-bit floats. So the first optimization I did was the simplest and most naive thing I could think of, which was to implement all of the float3 operations using SIMD intrinsics. I knew this wouldn’t be optimal, but figured it would give me a sense for whether it was worth investing more work to design a CPU-specific solution. Note that most of the time when one is dealing with float3 vectors, they are aligned in memory as if they were float4, in other words to 32-byte boundaries. So really you’re working with vectors like {x, y, z, w}, even if the w component is not actually used. For this experiment I used the 128-bit SIMD instructions offered by SSE on x86_64 processors and NEON on arm64 processors. Because Anukari’s float3 vectors are really float4 vectors with an ignored w component, it’s extremely simple to implement basic arithmetic operations using SSE/NEON. In both cases, there’s an instruction to load the float4 into a SIMD register, an instruction to do the arithmetic operation (such as add), and then an instruction to store the float4 register back into memory. Thus, the Float3Add() function might look like this using SSE: __m128 p1 = _mm_load_ps(&position1); __m128 p2 = _mm_load_ps(&position2); __m128 d = _mm_add_ps(p2, p1); _mm_store_ps(&delta,...

First seen: 2025-11-22 23:16

Last seen: 2025-11-23 01:17

Read Full Article More from this Source

Anukari on the CPU (part 2: CPU optimization)

Summary

Related News

SSE sucks for transporting LLM tokens

Hacking Google Chrome Source Code: Make Puppeteer work over Redis PubSub

Photographer Built a Medium-Format Rangefinder, and So Can You

Fast, Memory-Efficient Hash Table in Java: Borrowing the Best Ideas

Computer Animator and Amiga fanatic Dick Van Dyke turns 100