KVR Audio

juha_p · Post by **juha_p** » Mon Dec 10, 2018 9:18 am

What would be the best alternative for _mm_hsum_* intrinsic?

I'm following benchmark from this github project.
I added two other methods found from here and one scalar method which uses memcpy() instead of _mm_store_ps() to the test but none of them could beat the scalar method introduced there.
Here are timings I get using settings given in (testsum128.cc) source file (g++, no optimizations, old i5 2.67Mhz):

Code: Select all

Function                Total time (ms)        Ignore
scalar                    433.251           7.03693e+13
vector via _mm_hadd_ps    515.603           7.03693e+13
vector via shuffles1      587.389           7.03693e+13
vector via shuffles2      683.393           7.03693e+13
hsum_ps_sse1              477.442           7.03693e+13
hsum_ps_sse3              464.465           7.03693e+13
scalar_memcpy             885.835           7.03693e+13

Are there faster/better methods to do this hsum task?

2DaT · Post by **2DaT** » Mon Dec 10, 2018 11:20 am

Test code is kinda misleading. I doubt it can measure anything.

Code: Select all

template <float (*sum_func)(__m256)> void do_test(const char* func_name, unsigned N, unsigned repQty)

Most likely sum_func will not be inlined.

Code: Select all

 
    for (unsigned k = 0; k < N; ++k) {
      __m256 inp = _mm256_set_ps(k, k + 1, k + 2, k + 3, k + 4, k + 5, k + 6, k + 7);
      float res = sum_func(inp); 
      sum_total += res;
}

Set intrinsic will skew the results heavily. It's not free.

Usually, you want to separate accumulators and do a horizontal sum in the end.

Code: Select all

s0=s1=s2=s3=0;
for(..)
{
	s0+= load(a+i);
	s1+= load(a+i+4);
	s2+= load(a+i+8);
	s3+= load(a+i+12);
}
s0+=s1;
s2+=s3;
s0+=s2;
return vector_horizontal(s0);

The efficiency of a vector_horizontal does not matter much when a is large.

There is no efficient method for horizontal sum on a single register, but if you need 4 separate horizontal sums you can use the code below. (useful for FIR filters)

Code: Select all

//returns [hsum(a0),hsum(a1),hsum(a2),hsum(a3)]
	FORCE_INLINE __m128 quad_hsum(__m256 a0, __m256 a1, __m256 a2, __m256 a3)
	{
		__m256 a01lo = _mm256_unpacklo_ps(a0, a1);
		__m256 a01hi = _mm256_unpackhi_ps(a0, a1);
		__m256 a23lo = _mm256_unpacklo_ps(a2, a3);
		__m256 a23hi = _mm256_unpackhi_ps(a2, a3);
		__m256 a01psum = _mm256_add_ps(a01lo, a01hi);
		__m256 a23psum = _mm256_add_ps(a23lo, a23hi);
		__m256 a0123_02 = _mm256_shuffle_ps(a01psum, a23psum, _MM_SHUFFLE(1, 0, 1, 0));
		__m256 a0123_13 = _mm256_shuffle_ps(a01psum, a23psum, _MM_SHUFFLE(3, 2, 3, 2));
		__m256 sum = _mm256_add_ps(a0123_02, a0123_13);
		__m128 vlow = _mm256_castps256_ps128(sum);
		__m128 vhigh = _mm256_extractf128_ps(sum, 1);
		__m128 res = _mm_add_ps(vlow, vhigh);
		return res;
	}

Relevant discussions on SO:
https://stackoverflow.com/questions/699 ... sum-on-x86
https://stackoverflow.com/questions/138 ... bit-floats

SSE, hsum speed