I'm following benchmark from this github project.
I added two other methods found from here and one scalar method which uses memcpy() instead of _mm_store_ps() to the test but none of them could beat the scalar method introduced there.
Here are timings I get using settings given in (testsum128.cc) source file (g++, no optimizations, old i5 2.67Mhz):
Code: Select all
Function Total time (ms) Ignore
scalar 433.251 7.03693e+13
vector via _mm_hadd_ps 515.603 7.03693e+13
vector via shuffles1 587.389 7.03693e+13
vector via shuffles2 683.393 7.03693e+13
hsum_ps_sse1 477.442 7.03693e+13
hsum_ps_sse3 464.465 7.03693e+13
scalar_memcpy 885.835 7.03693e+13