KVR Audio

2DaT · Post by **2DaT** » Sat Sep 15, 2018 2:33 pm

juha_p wrote: MonoDevelop in Ubuntu g++ as compiler w/o optimizations ... also, remember that I don't know much of implementing SSE code so many things can be wrong there and the test process is what it is (just look the code linked in one of my posts earlier in this thread).

Well, debug builds are awfully inefficient, so it makes little sense to do performance comparisons on those.

juha_p · Post by **juha_p** » Sat Sep 15, 2018 3:02 pm

2DaT wrote:
juha_p wrote: MonoDevelop in Ubuntu g++ as compiler w/o optimizations ... also, remember that I don't know much of implementing SSE code so many things can be wrong there and the test process is what it is (just look the code linked in one of my posts earlier in this thread).
Well, debug builds are awfully inefficient, so it makes little sense to do performance comparisons on those.

Yes, the difference is huge between debug and release build. Can you wink some better performance test tools for function testing (c++).

2DaT · Post by **2DaT** » Sat Sep 15, 2018 6:29 pm

juha_p wrote: Yes, the difference is huge between debug and release build. Can you wink some better performance test tools for function testing (c++).

Test code:

Code: Select all

#include <iostream>
#include <limits>
#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif
#include <algorithm>
#include <xmmintrin.h>
#include "TestProcedures.h"

int main()
{
	using namespace std;
	//_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
	//_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
	size_t dataSize_Kb = 2*1024;
	size_t dataSize_B = dataSize_Kb * 1024;
	size_t dataSize = dataSize_B / sizeof(float);
	float* input_data = (float*) _mm_malloc(dataSize_B, 4096);
	float* output_data =  (float*)_mm_malloc(dataSize_B, 4096);

	for (int i = 0; i < dataSize; i++)
	{
		output_data[i] = rand()*0.0001f;
	}

	uint64_t minimum1=UINT64_MAX, minimum2= UINT64_MAX;
	for (int i = 0; i < 50; i++)
	{
		auto ticks = __rdtsc();
		testProc1(input_data, output_data, dataSize);
		auto ticks1 = __rdtsc();
		testProc2(input_data, output_data, dataSize);
		auto ticks2 = __rdtsc();
		ticks2 -= ticks1;
		ticks1 -= ticks;

		minimum1 = std::min(minimum1, (uint64_t)ticks1);
		minimum2 = std::min(minimum2, (uint64_t)ticks2);
	}
	cout << "Approx. rdtsc/val...." << endl;
	cout << "First : " << (double)minimum1 / dataSize << "\nSecond : " << (double)minimum2 / dataSize << endl;
	_mm_free(input_data);
	_mm_free(output_data);
	return 0;
}

Header:

Code: Select all

#pragma once
void testProc1(const float * __restrict in, float * __restrict out, size_t size);
void testProc2(const float * __restrict in, float * __restrict out, size_t size);

Test functions:

Code: Select all

#include "TestProcedures.h"
#include <emmintrin.h>
#include <cmath>
void testProc1(const float * __restrict in, float * __restrict out, size_t size)
{
	for (size_t i = 0; i < size; i++)
	{
		out[i] = exp(in[i]);
	}
}

__m128 exp_mp(const __m128& x)
{
	//[-0.5,0.5] 2^x approx polynomial ~ 2.4 ulp
	const __m128 c0 = _mm_set1_ps(1.000000119e+00f);
	const __m128 c1 = _mm_set1_ps(6.931469440e-01f);
	const __m128 c2 = _mm_set1_ps(2.402212024e-01f);
	const __m128 c3 = _mm_set1_ps(5.550713092e-02f);
	const __m128 c4 = _mm_set1_ps(9.675540961e-03f);
	const __m128 c5 = _mm_set1_ps(1.327647245e-03f);

	const __m128 l2e = _mm_set1_ps(1.442695041f);

	//f = log2(e) *x
	__m128 f = _mm_mul_ps(x, l2e);

	//i = round(f)
	//f = f - i   
	__m128i i = _mm_cvtps_epi32(f);
	f = _mm_sub_ps(f, _mm_cvtepi32_ps(i));

	//estrin-horner evaluation scheme
	__m128 f2 = _mm_mul_ps(f, f);
	__m128 p = _mm_add_ps(_mm_mul_ps(c5, f), c4);
	p = _mm_mul_ps(p, f2);
	p = _mm_add_ps(p, _mm_add_ps(_mm_mul_ps(c3, f), c2));
	p = _mm_mul_ps(p, f2);
	p = _mm_add_ps(p, _mm_add_ps(_mm_mul_ps(c1, f), c0));

	// i << 23
	i = _mm_slli_epi32(i, 23);
	//r = (2^i)*(2^f)
	//directly in floating point exponent
	return _mm_castsi128_ps(_mm_add_epi32(i, _mm_castps_si128(p)));
}

void testProc2(const float * __restrict in, float * __restrict out, size_t size)
{
	for (size_t i = 0; i < size; i+=4)
	{
		_mm_store_ps(out + i, exp_mp(_mm_load_ps(in + i)));
	}
}

You want those files separated to prevent compiler cross-optimization. Also, don't use LTO. Didn't check on compilers other than msvc though.

juha_p · Post by **juha_p** » Sat Sep 15, 2018 7:55 pm

Thanks!

I have few versions of VS installed so no trouble getting it working.

Why I ran my tests in debug mode was just because of in release mode there were no differences even I set the test run many many times longer from what it is there in code (tried inner loop with j<10000000 --> solving std::exp() and std::exp() would have taken long time to get the result for comparison use...).

EDIT: After transforming my easyEXP_sse methods to float data type, a quick test with your code brought these results:

Code: Select all

Approx. rdtsc/val....
std::exp() : 26.6975
exp_mp() : 3.13861
easyEXP_sse() : 2.91754
easyEXP_sse_tweaked() : 3.15795

Have to make the test to accept testing __m128d and float/double types.

EDIT:
Made a faster version of the float MacLaurinEXP_SSE() by putting the code to calculate 4 divs and adds at a time (isn't SSE best just in this?) and then added the horizontal sums ... result:

Code: Select all

Approx. rdtsc/val....
std::exp() : 26.6983
MacLaurinEXP_SSE() : 0.579626

Quite fast already

mystran · Post by **mystran** » Sat Sep 15, 2018 9:03 pm

juha_p wrote:
2DaT wrote:
juha_p wrote: MonoDevelop in Ubuntu g++ as compiler w/o optimizations ... also, remember that I don't know much of implementing SSE code so many things can be wrong there and the test process is what it is (just look the code linked in one of my posts earlier in this thread).
Well, debug builds are awfully inefficient, so it makes little sense to do performance comparisons on those.
Yes, the difference is huge between debug and release build. Can you wink some better performance test tools for function testing (c++).

There is like ZERO point in ever benchmarking debug builds, because those are specifically designed to be easy to for a debugger to follow, which involves intentionally producing really inefficient binary code to make sure all the variables are always accessible. Compilers might literally do things like store every temporary to memory, which takes orders of magnitude more time than the actual code you wrote and completely screws up even the relative performance of different algorithms.

They might event throw in some extra code to do additional checking in order to give you helpful run-time diagnostics. Debug builds are literally only ever useful for debugging errors and even then only in case you have an error that's actually predictable enough to show up in a debug build.

ps. in general enabling debug symbols in release builds is fine, those are just data embedded for the debugger to try to figure out what's going on, although success is quite variable when optimisations are enabled... but you really REALLY need at least basic optimisations (often even things like register allocation are part of the optimiser) to produce anything that's even remotely useful for measuring performance

camsr · Post by **camsr** » Sun Sep 16, 2018 4:53 am

It's useful to use fgets() and the 'stdin' pipe so the compiler cannot "over-optimise" a basic test program. If the compiler can resolve that something is constant, it will not reflect the same build as when it's actually used on whatever data. fgets() will give you a basic input which should reflect the input variables in your tested functions, so they are absolutely unknown at compile time.

I also found that if the compiler cannot solve something, it will not optimize the constants into the program. For whatever reason, this incrementer I wrote does not resolve to a constant, possibly because either the loop it's used in is too large, or because it accesses the heap, or because of the punned types, or possibly the conditional statements...

Code: Select all

AT(inline) void loop_inc(test_state* ptr_ts)
{
    test_state ts = *ptr_ts;
    union { float f; uint32_t u; }pun;

    ts.c1--;  // dec loop counter

    // input increment logic
    pun.f = ts.low; ts.low2 = pun.u;
    if (ts.low2 >= 2147483648u) { ts.low2--; }  // choose inc or dec to iterate towards 2147483648u (-0.f)
    if (ts.low2 < 2147483648u) { ts.low2++; }
    if (ts.low2 == 2147483648u) { ts.low2 = 0u; } // filter -0.f
    pun.u = ts.low2; ts.low = pun.f;

    *ptr_ts = ts;
}

juha_p · Post by **juha_p** » Sun Sep 23, 2018 5:30 pm

EDIT: Simplified the question: How these two lines are translated to SSE intrinsics:

/* multiply by power of 2 */
z *= uint32_to_sp((n+0x7f)<<23);

and

// Build 2^n in double.
x *= uint642dp( (((uint64_t)n) + 1023) << 52);

Original source code.

2DaT · Post by **2DaT** » Wed Sep 26, 2018 3:35 pm

juha_p wrote:EDIT: Simplified the question: How these two lines are translated to SSE intrinsics:
/* multiply by power of 2 */
z *= uint32_to_sp((n+0x7f)<<23);

Correct way: will handle every p right, but only when -127 < i < 128. Same as original.

Code: Select all

i = _mm_add_epi32(i,_mm_set1_epi32(0x7f));
i = _mm_slli_epi32(i, 23);
return _mm_mul_ps(p,_mm_castsi128_ps(i));

"smart way": will produce garbage if either p or resulting number is bad (NaNs, infinities, denormals, +-zero), but will handle some cases when exponent is out of range. Slightly faster. Use with caution

Code: Select all

i = _mm_slli_epi32(i, 23);
return _mm_castsi128_ps(_mm_add_epi32(i, _mm_castps_si128(p)));

You need more operations for the case i==-127 (denormals). But I think you shouldn't worry much about denormals in your SSE code because they perform badly. You want DAZ and FTZ modes on.

Same with double but with different constants/instructions.

juha_p · Post by **juha_p** » Wed Sep 26, 2018 6:26 pm

Thanks again!
Yes, looks like the implementation works in range [-87.6, 87.9] ... .

Got it working!

BTW, that double precision version of "Pade based" exp approximation is excellent ... as accurate as the "Remez13 based" listed @ here.

EDIT:

For some data and operation types, we see the expected vector speedup with respect to the used vectorization type (4 times with SSE and 8 times with AVX in case of 4-byte operands; 2 times with SSE and 4 times with AVX in case of 8-byte operands). But others (such as SQRT() operations with AVX or integer operations with AVX) don't show the expected vectorization speedup. For some data and operation types such as all integer operations, no additional speedup with AVX over SSE is observed. Some operations (for example, EXP()) show worse performance with vectorization.

Source: http://www.nersc.gov/users/computationa ... orization/

juha_p · Post by **juha_p** » Wed Jan 30, 2019 2:36 pm

Made a little comparison between few methods (Lagrange & Newton, Maclaurin, Pade , MiniMax, Equally spaced interpolation, Chebyshev, Taylor) of degree 8 (or 9 coefficients) to approximate exp() over interval [-1,1].

https://www.desmos.com/calculator/pougvl2dpb

Are there ready-to-use source code available to calculate ULP for these?

mystran · Post by **mystran** » Wed Jan 30, 2019 4:15 pm

2DaT wrote: ↑Wed Sep 26, 2018 3:35 pm You need more operations for the case i==-127 (denormals). But I think you shouldn't worry much about denormals in your SSE code because they perform badly. You want DAZ and FTZ modes on.

Sometimes it might actually be nice to get zeroes for underflows even if you don't care for the denormals. As an example, tanh(x)=(exp(2*x)-1)/(exp(2*x)+1) is normally safe for large negative values (and one can handle the positive values by symmetry to avoid overflowing exp()), it just saturates to -1/1 as exp(2*x) becomes zero.

I suppose in most audio situations one would approximate tanh() separately, but there might be a few other similar situations where underflow is actually perfectly fine as long as exp() correctly returns zero.

2DaT · Post by **2DaT** » Wed Jan 30, 2019 7:47 pm

mystran wrote: ↑Wed Jan 30, 2019 4:15 pm
Sometimes it might actually be nice to get zeroes for underflows even if you don't care for the denormals. As an example, tanh(x)=(exp(2*x)-1)/(exp(2*x)+1) is normally safe for large negative values (and one can handle the positive values by symmetry to avoid overflowing exp()), it just saturates to -1/1 as exp(2*x) becomes zero.

I suppose in most audio situations one would approximate tanh() separately, but there might be a few other similar situations where underflow is actually perfectly fine as long as exp() correctly returns zero.

Well, you don't need a full range exp for a good tanh either. I think for all x > 15, tanh(x) is exactly 1 in floating point arithmetic.

It's HARD to beat a vectorized MSVC tanh which takes about 5 cycles/value.

tanh(2x)=(exp(x)-1)/(exp(x)+1)

This variant is not bad, but you need to take extra steps to avoid a cancellation in the numerator. I think it should be around 3-4 c/v, depending on how accurate exp approximation is.

Another promising approach would be a straight minmax rational with a carefully chosen interval. I don't know a software that can do a reliable relative minmax rational fit though.

I "almost" made a rational remez code, but the equations are extremely ill-conditioned and the 64-bit floating point doesn't cut it. Well, a 1024-bit boost::multiprecision float will do it, but you will have to pay over 9000 in performance costs. Not an exaggeration

mystran · Post by **mystran** » Wed Jan 30, 2019 9:16 pm

2DaT wrote: ↑Wed Jan 30, 2019 7:47 pm
mystran wrote: ↑Wed Jan 30, 2019 4:15 pm
Sometimes it might actually be nice to get zeroes for underflows even if you don't care for the denormals. As an example, tanh(x)=(exp(2*x)-1)/(exp(2*x)+1) is normally safe for large negative values (and one can handle the positive values by symmetry to avoid overflowing exp()), it just saturates to -1/1 as exp(2*x) becomes zero.

I suppose in most audio situations one would approximate tanh() separately, but there might be a few other similar situations where underflow is actually perfectly fine as long as exp() correctly returns zero.
Well, you don't need a full range exp for a good tanh either. I think for all x > 15, tanh(x) is exactly 1 in floating point arithmetic.

Python says:

>>> math.tanh(19.)
0.9999999999999999

I guess by "floating point" arithmetic you mean "single precision" arithmetic.

Anyway, the point I was trying to make is that sometimes it's nice if exp() underflows into exact zero, so you save the trouble of having to insert a guard branch... but then again if that means that exp() itself needs such a branch, that doesn't really get us anywhere.

juha_p · Post by **juha_p** » Thu Jan 31, 2019 11:38 am

Hmm... is tanh(x) hard to be approximated well (accuracy & speed)?
I tried with Padé approximant and it looks like you need 16th order function to get error stay below 10^-5 level over interval of [-20, 20].

EDIT: Noticed that the Taylor approx of exp() on linked plots in my prev post works well only >0 ... is it OK?

mystran · Post by **mystran** » Thu Jan 31, 2019 11:57 am

juha_p wrote: ↑Thu Jan 31, 2019 11:38 am Hmm... is tanh(x) hard to be approximated well (accuracy & speed)?

Have you tried doing a Pade approximation on t(x)=tanh(sqrt(x))/sqrt(x), such that tanh(x)=x*t(x*x)?

I've only really used these for waveshapers, so never really cared for the exact errors... but if I remember correctly, it should converge faster than direct approximation of tanh.

My b^x approximation