KVR Audio

camsr · Post by **camsr** » Mon Sep 03, 2018 7:32 pm

This is an approximation function for 2^x which uses a raised hyberbolic to mimic an exponential in a small range. I'm going to post it here because I think it could be useful, especially to help avoid the use of the slow and bulky pow() which prioritizes accuracy over speed. The errata of this approximation is small enough to maintain non-critical accuracy, but it can be improved somewhat, at the expense of a few more operations and added quantization issues.

Some things could be added and improved, for example:
A method to translate an input for the base to the necessary magic constant.
The "result_scaler" can use a larger range, it was done here with a bitshift for speed.
A well constructed loop that can be unrolled at compile time, to allow a speed/accuracy tradeoff.

About the magic constant, with a constant of 1 this will approximate e^x. At the same time the errata within the [-.5,.5) domain will require small adjustment of the constant to minimize total error.

To use a base that isn't a power of two requires modulo in the equation.

Code: Select all

#include <stdio.h>
#include <math.h>
#include <stdint.h>

int main()
{

    double in = 1.0;  // input should be positive, compensate for input domain with offset
// input domain is (-.5,63.5)
    double in_rounded = round(in);
    double result_scaler = (double)( ((uint64_t)1) << ((uint64_t)in_rounded) );

    double h = 0.693038794439242;  // magic multiplier for the base (this is base2)

    double x = (in - in_rounded) * h;


    double dividend = x+16.0;
    double divisor = x-16.0;
    double result = (dividend / divisor);
    result *= result;
    result *= result;
    result *= result;
    result *= result_scaler;

// multiply by the 1/2^offset that compensates the positive input

    printf("\nresult = %.15f", result);
    return 0;
}

2DaT · Post by **2DaT** » Tue Sep 04, 2018 2:24 am

I've compared your approximation to MSVC sse2 expf implementation on skylake processor.

Your version : 65.1511 (cycles/value)
MSVC : 14.9906 (cycles/value)

std::round compiles to some ancient function call.
Making some "improvements" to make this a fair battle (maybe not correct, but i'm comparing from perf. point of view):

Code: Select all

float myround(float x)
{
	return _mm_cvtss_f32(_mm_cvtepi32_ps(_mm_cvtps_epi32(_mm_set_ss(x))));
}
float ap_test(float x2)
{
	float in = x2;  // input should be positive, compensate for input domain with offset
					  // input domain is (-.5,63.5)
	//float in_rounded = roundf(in);
	float in_rounded =myround(in);
	float result_scaler = (float)(((uint32_t)1) << ((uint32_t)in_rounded));

	float h = 0.693038794439242f;  // magic multiplier for the base (this is base2)

	float x = (in - in_rounded) * h;


	float dividend = x + 16.0f;
	float divisor = x - 16.0f;
	float result = (dividend / divisor);
	result *= result;
	result *= result;
	result *= result;
	result *= result_scaler;

	// multiply by the 1/2^offset that compensates the positive input

	return result;
}

Benchmarking after:
Your : 10.5513
MSVC: 14.9947

Which is okay i guess. But precision is awful compared to what you can get for the same cost.
I've plotted some error graphs (compared to minmax poly).
https://www.desmos.com/calculator/ix85wrsjyw
And here is what i would use:

Code: Select all

	__m128 exp_mp(const __m128& x)
	{
		//[-0.5,0.5] 2^x approx polynomial ~ 2.4 ulp
		const __m128 c0 = _mm_set1_ps(1.000000119e+00f);
		const __m128 c1 = _mm_set1_ps(6.931469440e-01f);
		const __m128 c2 = _mm_set1_ps(2.402212024e-01f);
		const __m128 c3 = _mm_set1_ps(5.550713092e-02f);
		const __m128 c4 = _mm_set1_ps(9.675540961e-03f);
		const __m128 c5 = _mm_set1_ps(1.327647245e-03f);

		const __m128 l2e = _mm_set1_ps(1.442695041f);

		//f = log2(e) *x
		__m128 f = _mm_mul_ps(x, l2e);

		//i = round(f)
		//f = f - i   
		__m128i i = _mm_cvtps_epi32(f);          
		f = _mm_sub_ps(f, _mm_cvtepi32_ps(i));

		//estrin-horner evaluation scheme
		__m128 f2 = _mm_mul_ps(f, f);
		__m128 p = _mm_add_ps(_mm_mul_ps(c5, f), c4);
		p = _mm_mul_ps(p, f2);
		p = _mm_add_ps(p, _mm_add_ps(_mm_mul_ps(c3, f), c2));
		p = _mm_mul_ps(p, f2);
		p = _mm_add_ps(p, _mm_add_ps(_mm_mul_ps(c1, f), c0));

		// i << 23
		i = _mm_slli_epi32(i, 23);
		//r = (2^i)*(2^f)
		//directly in floating point exponent
		return _mm_castsi128_ps(_mm_add_epi32(i, _mm_castps_si128(p)));
	}

Your : 10.5952
exp_mp : 10.2479

More or less the same. Your version is probably slower on old processors because of division.
I didn't check precision of your function, but i don't think it can provide more than a few bits of mantissa.

exp_mp will give you like 20 mantissa bits on reasonable input values + some non-compensated range reduction error which should not matter anyway (because of condition number of exp).

camsr · Post by **camsr** » Tue Sep 04, 2018 4:37 am

Thanks for doing this comparison. I was able to get around 41 cycles on my Q9450 (using doubles, haven't tested floats yet) with my formula, after implementing a change to rounding: double in_rounded = (double)((uint64_t)(in + 0.5));

https://www.desmos.com/calculator/xb96mds8d3
Here's the error comparison with the accuracy of my formula enhanced (notice the changed magic constant), at the expense of two more multiplications and less mantissa precision of the input (because of the high distance separation of the values added to X). Also, adding more multipliers to my formula didn't slow it down much at all, I'm not sure why. I was able to raise to 64 power before even 5 or so cycles were added.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Tue Sep 04, 2018 5:35 am

Division is slower everywhere, BTW. Seems like processor designers can't optimize it as much as they can optimize multiplication (which is only a little bit slower than addition on modern CPUs). Avoid divisions in time-critical code

2DaT · Post by **2DaT** » Tue Sep 04, 2018 11:16 am

Aleksey Vaneev wrote:Division is slower everywhere, BTW. Seems like processor designers can't optimize it as much as they can optimize multiplication (which is only a little bit slower than addition on modern CPUs). Avoid divisions in time-critical code

Well, divisions certainly got a lot better. Intel processors use pipelined division units since Ivy bridge, 8-wide since Skylake.

Division (32bit) is like ~3-4 multiplications on Skylake. On the other hand you get an FMA operation at the cost of addition/multiplication. This gives you some interesting performance tradeoffs between rationals and polynomials. Rationals have better approximation power (asymptotically), but polynomials are cheaper.

I think there are transcedental functions that always require a division (tanh,tan).

I'd say that divisions are perfectly OK for time-critical code. Just not as cheap.

BTW, multiplication has the same cost as addition (as complete FMA op) on Skylake.

Miles1981 · Post by **Miles1981** » Wed Sep 05, 2018 10:28 am

How does it perform compared to fmath exp?

camsr · Post by **camsr** » Wed Sep 05, 2018 9:15 pm

Miles1981 wrote:How does it perform compared to fmath exp?

Not sure about fmath, but for the Cstdlib exp() it does good. I'm currently writing a test program to compare functions for accuracy and precision.

Miles1981 · Post by **Miles1981** » Wed Sep 05, 2018 10:03 pm

fmath is far faster than std::exp and for double it is quite precise (probably better than some Intel mf versions) and also has a vectorized version (no idea how it compares to svml).

2DaT · Post by **2DaT** » Thu Sep 06, 2018 12:57 am

Skylake, turbo boost off, 8mb of noise data, minimum values over 50 runs, rdtsc to measure average clock cycles.
32 bit float tests:

scalar:
camsr : 10.5653
MSVC std::exp : 14.9873
exp_mp : 10.2527
fmath::exp : 8.42949

vector:
exp_mp : 2.75294
fmath::exp_ps : 4.0874
camsr : ~3 (didn't test).
MSVC vec exp : 3.98002

Accuracy tests on [-0.34657, 0.34657] and range reduction accuracy. (DAZ & FTZ on)

MSVC scalar: close to 0.5 ulp, correct range reduction.
MSVC vector: 3.215 ulp, correct range reduction.
fmath: 2.836 ulp, simplified range reduction.
exp_mp: 3.27 ulp, simplified range reduction.
camsr: 643.69 ulp , simplified range reduction.

MSVC and fmath perform additional checks if argument is in range.
fmath has 4kb lookup (prob. will trash the cache in real use).
MSVC has some kind of lookup too.
MSVC will use dynamic dispatch even when targeted for sse2. It used sse4.1 on my machine.

camsr · Post by **camsr** » Thu Sep 06, 2018 1:21 am

You can use this incrementer to step upwardly through a domain of [value, stop_value]. It had to be done this way to avoid incrementing by the FLT_MIN (which I'm not sure even works, never tried). It also works through zero, normally a float will only increment away from zero.

Code: Select all

while (value < stop_value)
{
// loop logic
if ((*((uint32_t*)&value)) == 2147483648u) {value = 0.f;} // the integer constant is negative zero
inc = (*((uint32_t*)&value)); 
(value < 0.f)? inc-- : inc++;
value = *((float*)&inc);

// call functions
result = exp(value);
}

juha_p · Post by **juha_p** » Thu Sep 06, 2018 9:18 am

Somehow I couldn't get this (known) trick to work (Linux) until changed the EXP(y) (_eco.n.j ... .

How this old trick compares today against those already mentioned ?

EDIT:

Couple links if speed is important

https://stackoverflow.com/questions/488 ... -using-avx
https://stackoverflow.com/questions/470 ... -using-sse

EDIT: few more SSE2 optimized exp (and log) implementations
http://www.chokkan.org/software/dist/fastexp.c.html

juha_p · Post by **juha_p** » Fri Sep 07, 2018 11:40 am

How this old trick compares today against those already mentioned ?

Ran a timing test for few implementations using code listing from http://www-h.eng.cam.ac.uk/help/tpl/lan ... ricks.html as base method. Looks like EXP() is faster than std::exp and std::pow but maybe not very accurate without tweaking the parameter values... .

camsr · Post by **camsr** » Sat Sep 08, 2018 10:48 am

I think I've managed to fix the result scaling for other bases, still have to test the formula on bases other than e.

mystran · Post by **mystran** » Sat Sep 08, 2018 11:10 am

2DaT wrote: Which is okay i guess. But precision is awful compared to what you can get for the same cost.
I've plotted some error graphs (compared to minmax poly).
https://www.desmos.com/calculator/ix85wrsjyw

The error doesn't look like proper minmax. I would guess this is a Chebychev approximation?

mystran · Post by **mystran** » Sat Sep 08, 2018 11:14 am

Aleksey Vaneev wrote:Division is slower everywhere, BTW. Seems like processor designers can't optimize it as much as they can optimize multiplication (which is only a little bit slower than addition on modern CPUs). Avoid divisions in time-critical code

Division in general is slower than sqrt() actually.

My b^x approximation