First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

stratum wrote: Tue Dec 04, 2018 7:30 pm
Next time, I need to "reset" it again setting all of its values to mMin...
yes.
If that does not work well, IPP documentation is long and you can still look for a better option (if any exists) or switch to some hand written code.
It seems I cannot do it, because it will "save" on the temp data, and than I would reset it to pValue.
Too much overhead:

Code: Select all

ippsSet_64f(mMin, mMinValues, blockSize);

// this pValue * mRange, for each values, which is good! 
// but later it adds mMinValues and STORE it on mMinValues, not pValue
ippsAddProductC_64f(pValue, mRange, mMinValues, blockSize);
Doing somethigs like this still works nice:

Code: Select all

ippsMulC_64f_I(mRange, pValue, blockSize);
ippsAddC_64f_I(mMin, pValue, blockSize);
But I'm not using "packet" instructions so... :ud:
Last edited by Nowhk on Thu Dec 06, 2018 10:58 am, edited 1 time in total.

Post

how could that mul and add be parallel anyway, I couldn't understand, but maybe a cup of coffee might help.

p.s. perhaps you should not be using mMinValues at all.

Code: Select all

ippsSet_64f(mMin, pValue,  blockSize);
ippsAddProductC_64f(pValue, mRange, pValue, blockSize);
~stratum~

Post

stratum wrote: Thu Dec 06, 2018 10:52 am how could that mul and add be parallel anyway, I couldn't understand, but maybe a cup of coffee might help.
:) I meant "packed", sorry. Corrected.
stratum wrote: Thu Dec 06, 2018 10:52 am p.s. perhaps you should not be using mMinValues at all.

Code: Select all

ippsSet_64f(mMin, pValue,  blockSize);
ippsAddProductC_64f(pValue, mRange, pValue, blockSize);
pValue got different values, and I need to add TO it, not FROM it. Remember:

Code: Select all

pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange
------------------------------------------------------------------------

Anyway, from this:

Code: Select all

pValue[sampleIndex] = exp((mMin + std::clamp(mSmoothedValues[sampleIndex] + pMod[sampleIndex], 0.0, 1.0) * mRange) * ln2per12);
to this:

Code: Select all

ippsAdd_64f(mSmoothedValuesVectorized, pMod, pValue, blockSize);
ippsThreshold_64f_I(pValue, blockSize, 0.0, ippCmpLess);
ippsThreshold_64f_I(pValue, blockSize, 1.0, ippCmpGreater);

ippsMulC_64f_I(mRange, pValue, blockSize);
ippsAddC_64f_I(mMin, pValue, blockSize);

ippsMulC_64f_I(ln2per12, pValue, blockSize);
ippsExp_64f_I(pValue, blockSize);
The whole app/iterations now run in ~1,5 seconds, instead of ~7 seconds.

AWESOME!!!! :party:

Post

@stratum: what about "recursive" IPP set? For example, if I have a Exponential moving average "1-pole smooth filter" such as:

Code: Select all

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
  mSmoothedValues[sampleIndex] = z1 = inputA0 + z1 * b1;
}
I can't do AddC + MulC, since z1 depends of prev values every time.

s there some sort of IPP functions also for this? Or built-in functions ready out of the box?

Post

As far as I can see these do not have any such D.C. term corresponding to mMinValues and if would be odd to for any filter to have anything such anyway. http://www.dspguide.com/CH19.PDF
~stratum~

Post

Nowhk wrote: Mon Dec 10, 2018 9:46 am @stratum: what about "recursive" IPP set? For example, if I have a Exponential moving average "1-pole smooth filter" such as:
Whether or not IPP has a function for doing this, your naive scalar code is almost certainly the fastest you can get, unless you can run several such filters in parallel. Breaking the serial dependency inherent in recursive filters generally involves at least log(n) parallel passes and that's never profitable on any CPU (not even close; it's quite tricky to make it profitable even on GPUs), because the SIMD architectures are far too narrow to make the parallel passes parallel enough.

Post

mystran wrote: Mon Dec 10, 2018 1:48 pm Whether or not IPP has a function for doing this, your naive scalar code is almost certainly the fastest you can get
Yes, but it don't use any SIMD function, its scalar :)
Can't I use SIMD on recursive data (i.e. getting the value from a previous computation)?

Post

Nowhk wrote: Mon Dec 10, 2018 4:54 pm
mystran wrote: Mon Dec 10, 2018 1:48 pm Whether or not IPP has a function for doing this, your naive scalar code is almost certainly the fastest you can get
Yes, but it don't use any SIMD function, its scalar :)
Can't I use SIMD on recursive data (i.e. getting the value from a previous computation)?
You can process multi channel data in that way.
~stratum~

Post

Nowhk wrote: Mon Dec 10, 2018 4:54 pm
mystran wrote: Mon Dec 10, 2018 1:48 pm Whether or not IPP has a function for doing this, your naive scalar code is almost certainly the fastest you can get
Yes, but it don't use any SIMD function, its scalar :)
Can't I use SIMD on recursive data (i.e. getting the value from a previous computation)?
"...unless you can run several such filters in parallel."

In other words: you can use SIMD to compute multiple (independent) recursive filters at the same time. You generally can't use SIMD in any useful way to compute one such recursive filter faster.

Post

mystran wrote: Mon Dec 10, 2018 5:43 pm
Nowhk wrote: Mon Dec 10, 2018 4:54 pm
mystran wrote: Mon Dec 10, 2018 1:48 pm Whether or not IPP has a function for doing this, your naive scalar code is almost certainly the fastest you can get
Yes, but it don't use any SIMD function, its scalar :)
Can't I use SIMD on recursive data (i.e. getting the value from a previous computation)?
"...unless you can run several such filters in parallel."

In other words: you can use SIMD to compute multiple (independent) recursive filters at the same time. You generally can't use SIMD in any useful way to compute one such recursive filter faster.
:) I see.

Yes, I've in parallel different filterings (such as 10 parameters that smooths constantly), but they have "our own path".
i.e. they have different fc/settings:

Code: Select all

// smooth params
for (int i = 0; i < mParamsSmoothed.GetSize(); i++) {
  mParamsSmoothed.Get(i)->SmoothBlock(blockSize);
}

...

inline void SmoothBlock(int blockSize) {
  double inputA0 = mNormalizedValue * mParamSmoother.a0;
  double b1 = mParamSmoother.b1;
  double z1 = mParamSmoother.z1;

  for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    mSmoothedValues[sampleIndex] = z1 = inputA0 + z1 * b1;
  }

  mParamSmoother.z1 = z1;
}	
Do you mean this with "parallel"?

Post

Nowhk wrote: Tue Dec 11, 2018 9:11 am Yes, I've in parallel different filterings (such as 10 parameters that smooths constantly), but they have "our own path".
i.e. they have different fc/settings:
Different coefficients don't matter, those are data... but you do have to organise that data properly.

I would highly recommend you look into some SIMD programming tutorial and get a hang of exactly what can and can not be done with SIMD. Even if you end up using some library, this will get you much better idea of where to expect performance gains.

That said, I also want to point out that "constantly smoothing parameters" generally doesn't scale very well past a few parameters. Normally you would only want to smooth parameters that actually need smoothing (ie. those that recently changed). While this can be quite a pain when it comes to keeping track of what to smooth at any given time, it can have much larger impact on your performance than any vectorisation of those smoothing filters.

Post

mystran wrote: Tue Dec 11, 2018 4:13 pm I would highly recommend you look into some SIMD programming tutorial
Any good suggestions? The documentation I find about this argument is exposed as some sort of "black magic", with lots of intrinsics code without any detailed infos.
mystran wrote: Tue Dec 11, 2018 4:13 pm That said, I also want to point out that "constantly smoothing parameters" generally doesn't scale very well past a few parameters. Normally you would only want to smooth parameters that actually need smoothing (ie. those that recently changed).
Not my case really: for the plugin I'm making, params are moving constantly, without any sort of "pause". Which suggest to me to consider the (O) worst case, in any case.
mystran wrote: Tue Dec 11, 2018 4:13 pm While this can be quite a pain when it comes to keeping track of what to smooth at any given time, it can have much larger impact on your performance than any vectorisation of those smoothing filters.
Not really a huge effort, I did once: just save the last processed smoothed value at the end of the processed block, and compare with the first one of the next block, at the beginning: if it differs, I do process the whole new block, else I simply skip it. i.e. a single "branch" evalutation that could save lots of steps, later.
But as I said, if I know that every time I need to smooth, it just adds useless overhead.

Post

Nowhk wrote: Wed Dec 12, 2018 12:24 pm
mystran wrote: Tue Dec 11, 2018 4:13 pm I would highly recommend you look into some SIMD programming tutorial
Any good suggestions? The documentation I find about this argument is exposed as some sort of "black magic", with lots of intrinsics code without any detailed infos.
The detailed information can be found here: https://software.intel.com/sites/landin ... sicsGuide/

Post

Or if you want learn from actual code:
https://github.com/kurasu/surge/tree/master/src

He uses lot of SSE intrinsics on DSP code, like here you have a process_block() for an ADSR-env in SSE2:
https://github.com/kurasu/surge/blob/ma ... Envelope.h
or a convolute() for a wavtable osc:
https://github.com/kurasu/surge/blob/ma ... llator.cpp

Post

mystran wrote: Wed Dec 12, 2018 2:22 pm
Nowhk wrote: Wed Dec 12, 2018 12:24 pm
mystran wrote: Tue Dec 11, 2018 4:13 pm I would highly recommend you look into some SIMD programming tutorial
Any good suggestions? The documentation I find about this argument is exposed as some sort of "black magic", with lots of intrinsics code without any detailed infos.
The detailed information can be found here: https://software.intel.com/sites/landin ... sicsGuide/
Yes, I've used that documents often recently, trying to undertand the underlying levels.
They are more "intrinsics" than "simd" though.

Do you code directly with intrinsics/asm code with some fixed sets?

I'm curious: do you use only SSE2? Or your own dispatch inst-sets code?
Or out-of-the-box ready libraries like IPP?
PurpleSunray wrote: Thu Dec 13, 2018 9:45 am Or if you want learn from actual code:
https://github.com/kurasu/surge/tree/master/src

He uses lot of SSE intrinsics on DSP code, like here you have a process_block() for an ADSR-env in SSE2:
https://github.com/kurasu/surge/blob/ma ... Envelope.h
or a convolute() for a wavtable osc:
https://github.com/kurasu/surge/blob/ma ... llator.cpp
Nice, thanks :tu: (again, and again, and again...)

Post Reply

Return to “DSP and Plugin Development”