Captain's Log: Stardate 79317.7 In part 1 of this series of posts, I mentioned that I was surprised to find that naively running my GPU code on the CPU was only 5x slower, when I thought it would be 100x slower. In this post I will explain how I ended up making the CPU implementation much faster than on the GPU. First approach: spot-vectorization As mentioned in part 1, I got the original GPU code compiled for the CPU, and then wrote a simple driver to call into this code and run the simulation (in lieu of the code that set up and invoked the GPU kernel). As you might imagine, Anukari, being a 3D physics simulation, does a lot of arithmetic on float3 vectors of the form {x, y, z}. In other words, vectors of three 32-bit floats. So the first optimization I did was the simplest and most naive thing I could think of, which was to implement all of the float3 operations using SIMD intrinsics. I knew this wouldn鈥檛 be optimal, but figured it would give me a sense for whether it was worth investing more work to design a CPU-specific solution. Note that most of the time when one is dealing with float3 vectors, they are aligned in memory as if they were float4, in other words to 32-byte boundaries. So really you鈥檙e working with vectors like {x, y, z, w}, even if the w component is not actually used. For this experiment I used the 128-bit SIMD instructions offered by SSE on x86_64 processors and NEON on arm64 processors. Because Anukari鈥檚 float3 vectors are really float4 vectors with an ignored w component, it鈥檚 extremely simple to implement basic arithmetic operations using SSE/NEON. In both cases, there鈥檚 an instruction to load the float4 into a SIMD register, an instruction to do the arithmetic operation (such as add), and then an instruction to store the float4 register back into memory. Thus, the Float3Add() function might look like this using SSE: __m128 p1 = _mm_load_ps(&position1); __m128 p2 = _mm_load_ps(&position2); __m128 d = _mm_add_ps(p2, p1); _mm_store_ps(&delta,...
First seen: 2025-11-22 23:16
Last seen: 2025-11-23 01:17