KVR Audio

Z1202 · Post by **Z1202** » Wed Feb 13, 2019 9:34 am

mystran wrote: ↑Tue Feb 12, 2019 8:36 pm
2DaT wrote: ↑Tue Feb 12, 2019 6:11 pm I did my own tests.
Code: Select all
Factored: 3.8990848138928413ulp
Horner: 0.76185148768126965ulp
The error is not dramatic, but not neligible either. And i have a feeling that it will be worse for higher orders.
I'm curious, did you solve the factors from the monomial, or build the monomial from the factors?

Besides that, there is a question what was the comparison reference: were both forms compared to a higher-precision version of the respective formula (Horner to Horner, factored to factored), or was a higher-precision Horner form always the reference. This is related to the question of the error source: precision losses during computation or coefficient quantization.

Anyway, FWIW, I personally usually wouldn't have any second thoughts using a 4ulp precise routine. If this precision is insufficient, then probably the representation precision is not sufficient for the surrounding computation either. IIRC some implementations of the C math library have lower precision.

Returning to the original question: I wonder if the factored form of e.g. this specific polynomial will be faster than Horner form (due to better parallelizability) and than Estrin's form (due to having less operations) and whether the answer depends on the FMA support

Maybe one day I'll bother checking this, so far my interest is mostly academic.

2DaT · Post by **2DaT** » Wed Feb 13, 2019 2:08 pm

Z1202 wrote: ↑Wed Feb 13, 2019 9:34 am Besides that, there is a question what was the comparison reference: were both forms compared to a higher-precision version of the respective formula (Horner to Horner, factored to factored), or was a higher-precision Horner form always the reference. This is related to the question of the error source: precision losses during computation or coefficient quantization.

Anyway, FWIW, I personally usually wouldn't have any second thoughts using a 4ulp precise routine. If this precision is insufficient, then probably the representation precision is not sufficient for the surrounding computation either. IIRC some implementations of the C math library have lower precision.

Returning to the original question: I wonder if the factored form of e.g. this specific polynomial will be faster than Horner form (due to better parallelizability) and than Estrin's form (due to having less operations) and whether the answer depends on the FMA support Maybe one day I'll bother checking this, so far my interest is mostly academic.

Yes, I tested it against double precision Horner's form. I redid the test with Horner-Horner, factored-factored.

Code: Select all

Horner: 0.76ulp
Factored: 3.12ulp

Coefficient conditioning does matter, because coefficients have finite precision. Even if you do root-finding with extended precision, rounding alters the roots, and the polynomial is not the same. For Horner's scheme it's more tolerable, because of "decaying" coefficients.
Actually there may be the cases where factored form does better than Horner's, but even then factored form won't be 1ulp accurate.

Performance tests for those who are curious.

Code: Select all

float Factored(float x)
{
	return 0.00015403599537f*(x + 5.14278588728699f)*
		((x - 2.522928813467399f)*x + 32.9484316700106f)*
		((x + 6.073420085609848f)*x + 26.55645202416117f);
}
float Horner(float x)
{
	return ((((0.00015403599537f*x + 0.00133907760028f)*x + 0.00961823761463f)*x
		+ 0.05550357326865f)*x + 0.24022649228573f)*x + 0.69314718246460f;
}
float Estrin(float x)
{
	float x2 = x * x;
	return (0.00015403599537f*x + 0.00133907760028f)*x2*x2 + (0.00961823761463f*x + 0.05550357326865f)*x2 + (0.24022649228573f*x + 0.69314718246460f);
}

Code: Select all

MSVC SSE2 Factored: 1.25586
MSVC SSE2 Horner: 1.48486
MSVC SSE2 Estrin: 1.38135
MSVC AVX2 FMA Factored: 0.503662
MSVC AVX2 FMA Horner: 0.459473
MSVC AVX2 FMA Estrin: 0.439453
Clang SSE2 Factored: 1.28906
Clang SSE2 Horner: 1.47046
Clang SSE2 Estrin: 1.38135
Clang FMA Factored: 0.45166
Clang FMA Horner: 0.375244
Clang FMA Estrin: 0.399902

As expected, Horner's form is great with FMA instructions.

Z1202 · Post by **Z1202** » Wed Feb 13, 2019 5:56 pm

2DaT wrote: ↑Wed Feb 13, 2019 2:08 pm
Code: Select all
SSE2 Factored: 1.25586
SSE2 Horner: 1.48486
AVX2 FMA Factored: 0.51001
AVX2 FMA Horner: 0.463867
Clang FMA Factored: 0.456055
Clang FMA Horner: 0.382324
As expected, Horner's form is great with FMA instructions.

Thanks for the figures! Did you use all FMA possibilities in the factored form (including factoring the leading coefficient into the first factor)?
Also would be nice to have a comparison to Estrin's scheme

2DaT · Post by **2DaT** » Wed Feb 13, 2019 7:23 pm

Z1202 wrote: ↑Wed Feb 13, 2019 5:56 pm Thanks for the figures! Did you use all FMA possibilities in the factored form (including factoring the leading coefficient into the first factor)?
Also would be nice to have a comparison to Estrin's scheme

See updated results.

Music Engineer · Post by **Music Engineer** » Thu Feb 14, 2019 7:48 pm

i guess guess i must misunderstand what these numbers show. i was assuming that it would be the maximum error in units of ulp - but then it couldn't be below 0.5, right? or am i missing something? or are these average errors?

Miles1981 · Post by **Miles1981** » Fri Feb 15, 2019 10:44 am

Music Engineer wrote: ↑Thu Feb 14, 2019 7:48 pm i guess guess i must misunderstand what these numbers show. i was assuming that it would be the maximum error in units of ulp - but then it couldn't be below 0.5, right? or am i missing something? or are these average errors?

Performance tests, not precision tests, I think!

Music Engineer · Post by **Music Engineer** » Sat Feb 16, 2019 8:13 pm

hmmm...maybe...but then - assuming we see cpu-cycles per computed value - shouldn't it be at least 1? or, well, ok, the throughput may be higher than that due to parallelization - but if these are the values for evaluating the full polynomial, they just look too good to be true. 0.5 cycles or less for evaluating a 5th order polynomial?

2DaT · Post by **2DaT** » Sat Feb 16, 2019 9:10 pm

Music Engineer wrote: ↑Sat Feb 16, 2019 8:13 pm hmmm...maybe...but then - assuming we see cpu-cycles per computed value - shouldn't it be at least 1? or, well, ok, the throughput may be higher than that due to parallelization - but if these are the values for evaluating the full polynomial, they just look too good to be true. 0.5 cycles or less for evaluating a 5th order polynomial?

Well, my cpu can do two 8-wide FMAs per cycle (Skylake). Horner's scheme for 5th order polynomial is 5 FMAs.

Code: Select all

(1/8)*(1/2)*5 = 0.3125

Close enough.

Music Engineer · Post by **Music Engineer** » Sat Feb 16, 2019 11:24 pm

ok - nice! thanks for clarification.

Estrin's scheme vs. factored polynomials