Going Faster Than Memcpy

https://news.ycombinator.com/rss Hits: 4
Summary

Going faster than memcpy While profiling Shadesmar a couple of weeks ago, I noticed that for large binary unserialized messages (>512kB) most of the execution time is spent doing copying the message (using memcpy) between process memory to shared memory and back. I had a few hours to kill last weekend, and I tried to implement a faster way to do memory copies. Autopsy of memcpy Here’s the dumb of perf when running pub-sub for messages of sizes between 512kB and 2MB. Children Self Shared Object Symbol + 99.86% 0.00% libc-2.27.so [.] __libc_start_main + 99.86% 0.00% [unknown] [k] 0x4426258d4c544155 + 99.84% 0.02% raw_benchmark [.] main + 98.13% 97.12% libc-2.27.so [.] __memmove_avx_unaligned_erms + 51.99% 0.00% raw_benchmark [.] shm::PublisherBin<16u>::publish + 51.98% 0.01% raw_benchmark [.] shm::Topic<16u>::write + 47.64% 0.01% raw_benchmark [.] shm::Topic<16u>::read __memmove_avx_unaligned_erms is an implementation of memcpy for unaligned memory blocks that uses AVX to copy over 32 bytes at a time. Digging into the glibc source code, I found this: #if IS_IN (libc) # define VEC_SIZE 32 # define VEC(i) ymm##i # define VMOVNT vmovntdq # define VMOVU vmovdqu # define VMOVA vmovdqa # define SECTION(p) p##.avx # define MEMMOVE_SYMBOL(p,s) p##_avx_##s # include "memmove-vec-unaligned-erms.S" #endif Breaking down this function: memmove: glibc implements memcpy as a memmove instead, here’s the relevant source code: # define SYMBOL_NAME memcpy # include "ifunc-memmove.h" libc_ifunc_redirected (__redirect_memcpy, __new_memcpy, IFUNC_SELECTOR ()); Here’s the difference between the two: With memcpy, the destination cannot overlap the source at all. With memmove it can. Initially, I wasn’t sure why it was implemented as memmove. The reason for this will become clearer as the post proceeds. erms: Enhanced Rep Movs is a hardware optimization for a loop that does a simple copy. In simple pseudo-code, this is what the loop implementation looks like for copying a single byte at a tim...

First seen: 2025-08-11 05:47

Last seen: 2025-08-11 08:48