Improved Taylor/Padé approximation of cos/sin functions

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

juha_p wrote: Sat Mar 26, 2022 7:47 am Results for approximations and sin() and cos() in ranges [-pi/4,pi/4], [-pi,pi] and [-2pi,2pi] (Float64 not tested):
These results are wrong. To test correctly you need to compare results of your function to known higher precision variant. To test 32-bit floats you need to compare it to library double (which is proven to be within few ulps). And even then it's still technically not corrert; there are cases where proper rounding of the result requires more than double precision.

For approximation we don't care about correct rounding at all. All we care is precision for performance, and usually few ulps are more than enough.

For sine and cosine the precision/performance optimum is minmax polynomial consisting of odd/even terms. Can't do much better without using tables. Pade/taylor approximations or loops of any kind will not deliver the performance.

Fiddling around with approximations is not really worth it unless you really know what you are doing.

Post

2DaT wrote: Sat Mar 26, 2022 7:34 pm ...
To test correctly you need to compare results of your function to known higher precision variant.
...
Hmm... I did mention it's not comparison against library functions. Dunno how could I compare against those with this code (I tried by manually changing the c=fun (in test_acc()) to be library function but got errors.

Post

2DaT wrote: Sat Mar 26, 2022 7:34 pm For sine and cosine the precision/performance optimum is minmax polynomial consisting of odd/even terms. Can't do much better without using tables. Pade/taylor approximations or loops of any kind will not deliver the performance.
To clarify this point.. Taylor approximations (or Pade for functions poorly approximated by polynomials) are great when you want a lot of precision around a single point and don't care about larger error away from that point. When that's the case, you can often get away with lower order approximation. I do this a lot for things like waveshapers, where I might want it to be linear around zero and follow a general shape elsewhere, but often the original function I'm trying to approximate isn't really a "ground truth" to begin with (rather just something in the approximate ballpark of what I'm looking for) so precision over the whole range is mostly irrelevant.

If you care about the precision over the whole range that you're trying to approximate, then basically by definition you want a minmax polynomial (or minmax rational in case of functions poorly approximated by polynomials), because minimizing the maximum error is precisely how you get the function to behave well over a range.

There is nothing wrong as such with approximating a function around some point without worrying too much about error further away, but this is a choice you have to make and for a general purpose sine or cosine approximation it's usually the wrong choice, because typically you want your sines and cosines to be usable over their whole range and not just close to a single point.

Post

mystran wrote: Sat Mar 26, 2022 8:54 pm
To clarify this point.. Taylor approximations (or Pade for functions poorly approximated by polynomials) are great when you want a lot of precision around a single point and don't care about larger error away from that point. When that's the case, you can often get away with lower order approximation.
....
Yes.

Taylor approximation:

Code: Select all

function taylor_sin(x::AbstractFloat)

    p = 3.141592653589793238462643383279502884
    o = p/x
    return p/o - (p^3/(o^3*6)) + (p^5/(o^5*120)) - (p^7/(o^7*5040)) + (p^9/(o^9*362880)) # - (p^11/(o^11*39916800))
    #return x - 1/6*x^3 + 1/120*x^5 - 1/5040*x^7 + 1/362880*x^9
end
works well too in range [-pi/4,, pi/4] :

Code: Select all

Tol debug failed 0.0% of the time.
sin
ULP max 0.5000696182250977 at x = 0.23337582
ULP mean 0.2499294509920759
Test Summary: | Pass  Total
Float16 sin   |    1      1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5000696182250977 at x = 0.23337582
ULP mean 0.2499294509920759
Test Summary: | Pass  Total
Float32 sin   |    1      1
Tol debug failed 0.0% of the time.
taylor_sin
ULP max 0.029365498572587967 at x = 0.7853982
ULP mean 0.002467445844114846
Test Summary:      | Pass  Total
Float16 taylor_sin |    1      1
Tol debug failed 0.0% of the time.
taylor_sin
ULP max 0.029365498572587967 at x = 0.7853982
ULP mean 0.002467445844114846
Test Summary:      | Pass  Total
Float32 taylor_sin |    1      1
This time I compared against sin(map(big, x)...) .

BTW, somehow the commonly used Taylor polynomial format (x-1/6*x^3+1/120*x^5-1/5040*x^7+1/362880*x^9) gives bit different results:

Code: Select all

Tol debug failed 0.0% of the time.
taylor_sin
ULP max 0.13420990109443665 at x = -0.78216547
ULP mean 0.012377569464872914
Test Summary:      | Pass  Total
Float16 taylor_sin |    1      1
Tol debug failed 0.0% of the time.
taylor_sin
ULP max 0.13420990109443665 at x = -0.78216547
ULP mean 0.012377569464872914
Test Summary:      | Pass  Total
Float32 taylor_sin |    1      1
Last edited by juha_p on Sun Mar 27, 2022 8:39 pm, edited 1 time in total.

Post

juha_p wrote: Sun Mar 27, 2022 7:17 am
mystran wrote: Sat Mar 26, 2022 8:54 pm
To clarify this point.. Taylor approximations (or Pade for functions poorly approximated by polynomials) are great when you want a lot of precision around a single point and don't care about larger error away from that point. When that's the case, you can often get away with lower order approximation.
....
Yes, Taylor approximation:

[...]

works well in range [-pi/4,, pi/4] :
This is not what "around a point" means. Around a point (eg. zero) means that you're seeking the "best" approximation in the range [-eps,eps] where eps tends to zero in the limit sense. It is really fundamental to undestand that we're literally talking about limits here.

Your [-pi/4,pi/4] is not "around a point" (in any mathematical sense) but rather a range. If you want to approximate a function within a range then for any chosen order, you will always(*) get a smaller maximum errors within that range by doing a minmax fit instead. Usually such an approximation won't just be better, it will be much better.

The other thing to keep in mind is that monomials in particular are terrible when it comes to numerical behaviour. When you're computing a high-order approximation in actual finite precision floating point, the result will typically be nowhere near as good as it theoretically should be. This is what I believe 2DaT tried to point out: what you really want to measure is the error between your approximation computed with the actual machine precision you plan to use and a much more precise "known good" implementation computed in higher precision.

In other words, if you're trying to do an approximation with single-precision floats, then what you want to measure is the accuracy of your approximation computed in single-precision arithmetic against something like a library function computed in doubles. This is because you're not just trying to measure your theoretical approximation, but also the numerical accuracy of the actual implementation.

The second one is important, because monomials in particular tend to be rather poor when it comes to numerical behaviour and it gets worse quite rapidly as the order increases. Rewriting in Horner's (or Estrin's, etc) form can improve things slightly, but there can still be somewhat of a tradeoff between trying to get a good theoretical approximation and trying to keep the order low enough that the numerical issues don't negate any gains from increasing the order (not to mention that a higher order approximation obviously also increases the computation cost).

(*) Assuming the function is not a polynomial of the same or lower order, in which case it could obviously be represented exactly.

ps. Also given than 1 ulp is the spacing between two floating point numbers, the theoretical best maximum error you can possibly ever get is .5 ulps (ie. the exact value is halfway between the closest two representable floating point numbers). If you're measuring any value less than this, then your measurement is basically guaranteed to be wrong.

Post

mystran wrote: Sun Mar 27, 2022 9:22 am
This is not what "around a point" means. Around a point (eg. zero) means that you're seeking the "best" approximation in the range [-eps,eps] where eps tends to zero in the limit sense. It is really fundamental to undestand that we're literally talking about limits here.

Your [-pi/4,pi/4] is not "around a point" (in any mathematical sense) but rather a range. If you want to approximate a function within a range then for any chosen order, you will always(*) get a smaller maximum errors within that range by doing a minmax fit instead. Usually such an approximation won't just be better, it will be much better.

The other thing to keep in mind is that monomials in particular are terrible when it comes to numerical behaviour. When you're computing a high-order approximation in actual finite precision floating point, the result will typically be nowhere near as good as it theoretically should be. This is what I believe 2DaT tried to point out: what you really want to measure is the error between your approximation computed with the actual machine precision you plan to use and a much more precise "known good" implementation computed in higher precision.

In other words, if you're trying to do an approximation with single-precision floats, then what you want to measure is the accuracy of your approximation computed in single-precision arithmetic against something like a library function computed in doubles. This is because you're not just trying to measure your theoretical approximation, but also the numerical accuracy of the actual implementation.

The second one is important, because monomials in particular tend to be rather poor when it comes to numerical behaviour and it gets worse quite rapidly as the order increases. Rewriting in Horner's (or Estrin's, etc) form can improve things slightly, but there can still be somewhat of a tradeoff between trying to get a good theoretical approximation and trying to keep the order low enough that the numerical issues don't negate any gains from increasing the order (not to mention that a higher order approximation obviously also increases the computation cost).

(*) Assuming the function is not a polynomial of the same or lower order, in which case it could obviously be represented exactly.

ps. Also given than 1 ulp is the spacing between two floating point numbers, the theoretical best maximum error you can possibly ever get is .5 ulps (ie. the exact value is halfway between the closest two representable floating point numbers). If you're measuring any value less than this, then your measurement is basically guaranteed to be wrong.
Sorry that you misunderstood my post written in hurry ... believe or not but, I already knew the difference between "around a point" and "range" :?

On previous page I wrote:
I have not managed to test ULP error against library sin() nor cos() functions but, by Julia Function Accuracy Test tool it looks like these are "excellent" (what ever this ULP test means) approximation method (add below code to the runtest.jl module which belongs to the Julia test suite):
Actually, I could not get float approximation against double precision library function test faithfully done because of function is not giving correct return values :

Code: Select all

As for an exmple x = -0.7853982f0 gives

Mathematica: sin(x): = -0.707106807068459559730712471249650919086696271494555843865
Julia: 	  sin(x...): = -0.70710677 
 sin(map(big, x)...) = -0.7071067966408574982218873068639201062404313604169267415232690981467895117867091
I have to learn Julia language a bit... so I can get correct results.

Post

juha_p wrote: Sun Mar 27, 2022 8:38 pm Sorry that you misunderstood my post written in hurry ... believe or not but, I already knew the difference between "around a point" and "range" :?
So why do you only care about limit behaviour around a single point?

Post

juha_p wrote: Sun Mar 27, 2022 8:38 pm
Actually, I could not get float approximation against double precision library function test faithfully done because of function is not giving correct return values :

Code: Select all

As for an exmple x = -0.7853982f0 gives

Mathematica: sin(x): = -0.707106807068459559730712471249650919086696271494555843865
Julia: 	  sin(x...): = -0.70710677 
 sin(map(big, x)...) = -0.7071067966408574982218873068639201062404313604169267415232690981467895117867091
I have to learn Julia language a bit... so I can get correct results.
I think I now found the answer for this matter.

Code: Select all

x = -0.7853982f0 --> 
x = -0.785398185253143310546875 -->  ( https://www.h-schmidt.net/FloatConverter/IEEE754.html )
sin(x)=-0.70710679664085749822188730686392010624043136041692674152326909814678951178671103
which equals well :

Code: Select all

[code]Julia sin(map(big, -0.7853982f0)...) = 
-0.7071067966408574982218873068639201062404313604169267415232690981467895117867091
Mathematica sin(-0.785398185253143310546875) = 
-0.7071067966408574982218873068639201062404313604169267415232690981467895117867110
This makes it hard to test range exactly because of as for an example range [-3.1415926, 3.1415926] --> [-3.141592502593994140625, 3.141592502593994140625].

Tested also against library sin:

approximation vs library sin

test_sin() as listed earlier

Code: Select all

pi/4  (3.1415926535897932)
Tol debug failed 0.0% of the time.
sin
ULP max 0.5000696182250977 at x = 0.23337582
ULP mean 0.2499294509920759
Test Summary: | Pass  Total
Float16 sin   |    1      1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5000696182250977 at x = 0.23337582
ULP mean 0.2499294509920759
Test Summary: | Pass  Total
Float32 sin   |    1      1

Tol debug failed 0.0% of the time.
test_sin
ULP max 0.5000757575035095 at x = 0.78457195
ULP mean 0.24992944972881506
Test Summary:    | Pass  Total
Float16 test_sin |    1      1
Tol debug failed 0.0% of the time.
test_sin
ULP max 0.5000757575035095 at x = 0.78457195
ULP mean 0.24992944972881506
Test Summary:    | Pass  Total
Float32 test_sin |    1      1

pi (3.1415)

Tol debug failed 0.0% of the time.
sin
ULP max 0.5008804798126221 at x = -0.930309
ULP mean 0.24972600217893073
Test Summary: | Pass  Total
Float16 sin   |    1      1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008804798126221 at x = -0.930309
ULP mean 0.24972600217893073
Test Summary: | Pass  Total
Float32 sin   |    1      1

Tol debug failed 0.0% of the time.
test_sin
ULP max 3.3329622745513916 at x = -3.1414747
ULP mean 0.24995011772343917
Test Summary:    | Pass  Total
Float16 test_sin |    1      1
Tol debug failed 0.0% of the time.
test_sin
ULP max 3.3329622745513916 at x = -3.1414747
ULP mean 0.24995011772343917
Test Summary:    | Pass  Total
Float32 test_sin |    1      1
float test_sin()

Code: Select all

pi/4
Tol debug failed 0.0% of the time.
test_sin
ULP max 3.5467605590820312 at x = -0.12518394
ULP mean 0.5468282984622705
Test Summary:    | Pass  Total
Float16 test_sin |    1      1
Tol debug failed 0.0% of the time.
test_sin
ULP max 3.5467605590820312 at x = -0.12518394
ULP mean 0.5468282984622705
Test Summary:    | Pass  Total
Float32 test_sin |    1      1

Post

This may or may not be useful (it could be if you're simply looking for a good approximation), but it could be interesting given the Taylor vs. minimax question. Last year, while reading up on approximations and experimenting a little as a hobby, I made a program which minimizes error by tweaking a Taylor polynomial to make it more minimax-y. If all scaling coefficients for the polynomial are optimized, the result is a plain minimax approximation -- but it's also possible to skip one or more of them, for an in-between result more like Taylor closer to the middle (as in the first graph in the article).

My program does not scale well for searching for minimax approximations of larger degrees, because each time another coefficient is to be narrowed down, another dimension is added to a search making it an order of magnitude slower. But for low-order stuff, it can be used instead of proper techniques like the Remez algorithm. My program takes a few minutes running on a laptop from 2016 to minimax-ify a Taylor polynomial of degree 7 (which has 4 terms, thus 4 scaling factors). But it's not user-friendly: to change what's searched for, you need to hardcode the test function and some parameters.

A sin(x) approximation of degree 7 is the lowest that allows for a fully clean 16-bit audio signal to be generated, I found, for input values from -PI/2 to +PI/2. What does it take to reduce the error from the level in a Taylor approximation? Optimize the coefficient for the highest-order term alone, and error is reduced to less than a tenth. Do the two highest, and it gets 5 times better than that. Do three, and the error is then a third or so of that. Four (that's all of them), and it's a little lower still.

If you want to use approximations from the article, the last two printouts (one for degree 5, one for degree 7) are from the "long double" version of the program, and a little better than the earlier. Their graphs and final numbers are for the tweak I did to remove all error at +/- PI/2 at the cost of doubling maximum error. For plain minimax, instead pick the numbers above the last part of those printouts.

I found it interesting that the final tweaked degree 5 polynomial I ended up with, which has no error for its biggest and smallest inputs +PI/2 and -PI/2, adds a 5th harmonic at -84 dB or so, but otherwise makes for a clean signal (when analyzing a 16-bit WAV file).

I think it should be possible to generate minimax approximations which instead use a slightly larger domain in order to place zero error at +/- PI/2, instead of removing error there by rescaling as I did, but that would require a different program. If done, that should be a little better in terms of less maximum error than my final tweaked/compromise approximations.
Image Scriptable AUdio format / music language under development...

Post

18 months later I quickly implemented this using SSE instructions but, left the two loops undone (actually dunno how loops are controlled in SSE world). Implementation is the sincos version from here.

https://godbolt.org/z/veb5vY661

Sorry, but could not properly test the performance nor accuracy ATM (and my coding is crappy).

Post

OK, trying to make the loops working but, there's something I don't get right in double-angle identity loop calculations. I think the multipliers are solved correctly but, summing does not go as it should in cases where there are multiplier(s) involved.

https://godbolt.org/z/T5r1xPv8f

When f parameter is set to lets say 0.75 then results are correct, f = 0.5 gives few correct results and smaller the number is more looping is done and the results are wrong (if multiplier were 0 from the beginning then result for this element is correct). It may be a simple matter I'm doing wrong either because of not understand what the code does (or just been staring the issue too long).

Any SSE pro's there on-line to help in this?

Post

juha_p wrote: Wed Dec 13, 2023 1:15 pm OK, trying to make the loops working but, there's something I don't get right in double-angle identity loop calculations. I think the multipliers are solved correctly but, summing does not go as it should in cases where there are multiplier(s) involved.

https://godbolt.org/z/T5r1xPv8f

When f parameter is set to lets say 0.75 then results are correct, f = 0.5 gives few correct results and smaller the number is more looping is done and the results are wrong (if multiplier were 0 from the beginning then result for this element is correct). It may be a simple matter I'm doing wrong either because of not understand what the code does (or just been staring the issue too long).

Any SSE pro's there on-line to help in this?
Vectorized code can be a great tool to achieve performance, but it also harder to write. If you can't write good intrinsics, start with plain c++ and go from here.

Besides, what is the ultimate point of reimplementing a function that is already in the standard library?

Post

2DaT wrote: Wed Dec 13, 2023 7:34 pm Vectorized code can be a great tool to achieve performance, but it also harder to write. If you can't write good intrinsics, start with plain c++ and go from here.

Besides, what is the ultimate point of reimplementing a function that is already in the standard library?
Just what I've done and tried to do :D ... . I've done it first with Matlab/Octave then with plain C++ and now try to translate it to SIMD based ... with full learning mode ON (1st get it working right and then look how to improve).

Hmm... in what standard library?

I think the issue is following and is located in double-angle identity loop:

Lets say f = 0.25 and the input x vector to be calculated is

{0.5, 1.01, -0.23, 0.99}

X values are halved before loop.
After repeat halving loop (where all x[n] > f[n] elements are halved as many times as needed to get x[n] <= f[n]), k's (in kvec) for x vector elements are:

{0, 2, 0, 1}

so, in double-angle identity loop, one element needs to pass the calculation twice, one element once and rest two jumps over.

Loop control is handled by kvec so that kvec value decreases by one after each loop. kvec value (1 or 0) is used as a multiplier in calculation.

Problem is how to sum the values in loop because of loop is executed by highest k value (twice in this example case) when the kvec values (i.e. multipliers 1 or 0) are in each round as:

{0 1 0 1}
{0 1 0 0}
{0 0 0 0}

i.e. last kvec element zeroes the result in 2nd calculation round and 2nd element does it too if we go to 3rd round. So, I would need to get that values saved before it is zeroed. I do it now by summing the values but that is done wrongly because of the final result is wrong. By using _mm_max_ps() in iteration loop all positive inputs are calculated correctly but negative are then -0 so function works for positive values only ATM.
Any suggestions on how to save result correctly.

Post

After good night sleep I had an answer in mind to get the function work correctly. What was needed was just to make all calculations done through positive input values. So had to take signs of input values and then restore them before final calculation. This answer is based on my own code given in post before but, I do make changes @PeterCordes @ Stack overflow suggested to get the performance improved. I have not yet tested if performance is worse or better compared to std library functions.

https://godbolt.org/z/qsjWsaPcK (cleaned: https://godbolt.org/z/h7qdf5zrq)

Here is the version based on Peter Cordes's suggestions.

https://godbolt.org/z/T177YYPPr (cleaned: https://godbolt.org/z/csK19EEP3 )

EDIT: Performance depends on how many loops is ran. I measured about 20% speed increase against std::sin() when maxloops=10. This was for my version. Dunno what there is fed as input values in 2DaT's rdtsc/val test routine so it may exceed 10 or not ... also dunno what would be best way to give input parameter for the test routine. I made it this way:

Code: Select all

void testProc2(const float * __restrict in, float * __restrict out, size_t size)
{
    __m128 f = _mm_set1_ps(0.25f);
	for (size_t i = 0; i < size; i+=4)
	{
		_mm_store_ps(out+i, sincos_sse(_mm_set1_ps(in[i]+i), f, true, 10));
	}
}
(which made Peter's version much slower?)

Post

juha_p wrote: Fri Dec 15, 2023 6:09 am EDIT: Performance depends on how many loops is ran. I measured about 20% speed increase against std::sin() when maxloops=10. This was for my version. Dunno what there is fed as input values in 2DaT's rdtsc/val test routine so it may exceed 10 or not ... also dunno what would be best way to give input parameter for the test routine. I made it this way:

Code: Select all

void testProc2(const float * __restrict in, float * __restrict out, size_t size)
{
    __m128 f = _mm_set1_ps(0.25f);
	for (size_t i = 0; i < size; i+=4)
	{
		_mm_store_ps(out+i, sincos_sse(_mm_set1_ps(in[i]+i), f, true, 10));
	}
}

(which made Peter's version much slower?)
Use this one and don't bother with implementing "library" sin.
You are unlikely to write anything better.

viewtopic.php?p=8789508#p8789508

If it's for oscillator, you never ever need to do range-reduction, because phasors are limited by definition.

If it's for the wave shaper and need large ranges, you can do some bit-shifting trickery - ( viewtopic.php?p=7032074#p7032074). Something like this one and corresponding polynomials on defined on [-0.5,0.5].

All these "iteration" based formulas are total garbage. And I see 6 _mm_div_ps intrinsics which means you don't really understand the performance implication of using certain intrinsics (the division is much much slower than muls and adds).

And in general if simple function is more than 20 instructions it's probably garbage. Look at how the 2 functions from my post are constructed and do something similar. No loops, at most 1 division per function, polynomials of reasonable order, no branching, etc.

Post Reply

Return to “DSP and Plugin Development”