Accurate SSE2/FMA Tan for BLT prewrap.

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

2DaT wrote: Sun Aug 18, 2019 8:57 pm
mystran wrote: Sun Aug 18, 2019 8:29 pm
2DaT wrote: Thu Aug 15, 2019 12:50 pm I don't think that reading constants from memory is bad, since it will be cached and memory loads integrate with arithmetic.
Tried this augmented version, and it does OK. There are loads, but there is no register spilling which is good.
Actually given how few memory references the disassembly actually has, it's probably just a pure latency penalty here.
And it's "good" kind of latency, because it doesn't add to dependency chain. Any decent out-of-order cpu will issue those loads ahead of time, and by decent I mean anything that's better than Pentium 4.
yup

Post

I came up with a much better solution. It's also tan(x*pi), so you can compute filter coefficient easily like that:

Code: Select all

g = tanpi_blt(freq*sample_rate_inverse)
Functions:

Code: Select all


  // Approximates tan(x*pi), x on (1e-35, 0.4999999). Useful for BLT prewrap.
  // SSE2 instructions, 4-way parallel
  inline __m128 tanpi_blt(__m128 x) {
  // R(x^2)*x is relative error minmax rational fit to tan(x*pi):
  // R(x) =
  // (-3.13832354545593x+3.31913113594055)/(x^2+-4.47475242614746x+1.05651223659515)
  // ~4 ULP error
  const __m128 num0 = _mm_set1_ps(3.319131136e+00);
  const __m128 num1 = _mm_set1_ps(-3.138323545e+00);
  const __m128 den0 = _mm_set1_ps(1.056512237e+00);
  const __m128 den1 = _mm_set1_ps(-4.474752426e+00);

  const __m128 half = _mm_set1_ps(0.5f);
  const __m128 quater = _mm_set1_ps(0.25f);
  const __m128 signmask = _mm_set1_ps(-0.0f);

  // We use this useful identity to compute on finite range of values:
  // tan(x - pi/2)  == 1 / -tan(x)
  // Since we compute tan(x*pi), identity is tanpi(x-0.5) == 1 / -tanpi(x)

  // Force the value into the finite range (-0.25,0.25)

  // Compute the mask, whether we compute using the identity (x>0.25) or
  // directly (x<0.25)
  __m128 inverse_mask = _mm_cmpgt_ps(x, quater);
  x = _mm_sub_ps(x, _mm_and_ps(inverse_mask, half));

  __m128 f2 = _mm_mul_ps(x, x);
  // Numerator, Horner's scheme
  __m128 num = _mm_add_ps(_mm_mul_ps(num1, f2), num0);
  // Andnot with sign mask to produce only positive results
  num = _mm_mul_ps(_mm_andnot_ps(signmask, x), num);

  // Denominator, Horner's scheme
  __m128 denom = _mm_add_ps(f2, den1);
  denom = _mm_add_ps(den0, _mm_mul_ps(denom, f2));

  // Since we already compute a rational function, we just need to swap
  // numerator and denominator to produce an inverse

  // If (mask) then swap(a,b)
  // Conditional swap with one simple trick...
  __m128 swap_xor = _mm_and_ps(inverse_mask, _mm_xor_ps(num, denom));
  num = _mm_xor_ps(num, swap_xor);
  denom = _mm_xor_ps(denom, swap_xor);

  return _mm_div_ps(num, denom);
}

// Approximates tan(x*pi), x on (1e-35, 0.4999999). Useful for BLT prewrap.
// AVX2+FMA instructions, 8-way parallel
inline __m256 tanpi_blt_fma(__m256 x) {
  const __m256 num0 = _mm256_set1_ps(3.319131136e+00);
  const __m256 num1 = _mm256_set1_ps(-3.138323545e+00);
  const __m256 den0 = _mm256_set1_ps(1.056512237e+00);
  const __m256 den1 = _mm256_set1_ps(-4.474752426e+00);
  const __m256 half = _mm256_set1_ps(0.5f);
  const __m256 quater = _mm256_set1_ps(0.25f);
  const __m256 signmask = _mm256_set1_ps(-0.0f);

  __m256 inverse_mask = _mm256_cmp_ps(x, quater, _CMP_GT_OS);
  x = _mm256_sub_ps(x, _mm256_and_ps(inverse_mask, half));

  __m256 f2 = _mm256_mul_ps(x, x);

  __m256 num = _mm256_fmadd_ps(f2, num1, num0);
  num = _mm256_mul_ps(_mm256_andnot_ps(signmask, x), num);

  __m256 denom = _mm256_add_ps(f2, den1);
  denom = _mm256_fmadd_ps(denom, f2, den0);

  __m256 denom_select = _mm256_blendv_ps(denom, num, inverse_mask);
  __m256 num_select = _mm256_blendv_ps(num, denom, inverse_mask);

  return _mm256_div_ps(num_select, denom_select);
}
Perf (compared to old, also reduced test size so results differ slightly).

Code: Select all

Old tanpi4_blt : 3.44653
New tanpi_blt : 2.4563
Old tanpi4_blt_fma: 1.06836
New tanpi_blt_fma : 0.902832
Error is around 3.6 ULPs, which is library quality.

Post

That's a beast!

BTW, how do you use the version in your 1st post?

Post

juha_p wrote: Tue Aug 20, 2019 4:33 pm That's a beast!

BTW, how do you use the version in your 1st post?
Don't use it, because new version is better. :D.

Post

thanks for sharing this. the numbers (performance and accuracy) look really good! :-)
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

Edited the initial post, because the new version is faster, more accurate and more convenient.

Post

2DaT wrote: Tue Aug 20, 2019 5:19 pm Don't use it, because new version is better. :D.
Yep, but, for learning purpose ... .
I prepared this Pade' based tan(x*pi/8) approximation for [-0.25,0.25] use but I have not yet figured out how to use it ... :? :

Code: Select all

tanpiper8=(x*(168892.0210569872844997517-2480.5021344239856140381*x^2))/(430080.0+x^2*(-28424.460675137352822243334+97.40909103400243723644*x^2))

Post

juha_p wrote: Tue Aug 20, 2019 7:57 pm Yep, but, for learning purpose ... .
I prepared this Pade' based tan(x*pi/8) approximation for [-0.25,0.25] use but I have not yet figured out how to use it ... :? :
I wouldn't bother with Pade/Taylor approximations these days. Too inefficient.

Post

I wouldn't bother with Pade/Taylor approximations these days. Too inefficient.
i think, that depends on your goals. for filter tuning, it may actually be reasonable to favor accuracy in the low-frequency range - and taylor/pade approximations give the optimal approximation around the expansion point (in the sense of matching the maximum number of derivatives for given order), whereas minimax weights all errors the same everywhere in the range.
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

Music Engineer wrote: Wed Aug 21, 2019 7:43 am
I wouldn't bother with Pade/Taylor approximations these days. Too inefficient.
i think, that depends on your goals. for filter tuning, it may actually be reasonable to favor accuracy in the low-frequency range - and taylor/pade approximations give the optimal approximation around the expansion point (in the sense of matching the maximum number of derivatives for given order), whereas minimax weights all errors the same everywhere in the range.
You can minimax with weighting or against relative error (or both), that's not really an issue. Sometimes you might prefer Taylor/Pade to avoid minimax ripple (ie. trade a larger error for a more preferable error shape), but here the error is tiny it's essentially irrelevant for the intended purpose of tuning filters.

In fact, I keep wondering whether it would be possible to do a less accurate version even faster, because in practice somewhere around 14 mantissa bits would be plenty. Then again 2DaT's second function is already very low order, so it's probably hard to get a significant speed up without getting rid of the division and handling the [0.25,0.5] range without a rational function is probably not going to work that well.

Post

You can minimax with weighting or against relative error (or both), that's not really an issue. Sometimes you might prefer Taylor/Pade to avoid minimax ripple
good point! bottom line: both kinds of approximations have their place - it all depends...
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

2DaT wrote: Mon Aug 19, 2019 3:06 am ...

Code: Select all

g = tanpi_blt([b]freq*sample_rate_inverse[/b])
Functions:
...

Is that multiply outside the function before calling faster than (or as fast as) multiplying inside the function (meaning, pass two parameters fs and fc in call and multiply using SSE/FMA ...)?

Any help available with the tan(x*pi/8) usage I'm still struggling with?
Last edited by juha_p on Wed Aug 28, 2019 3:23 am, edited 2 times in total.

Post

<duplicate post>

Post Reply

Return to “DSP and Plugin Development”