KVR Audio

Nowhk · Post by **Nowhk** » Fri Nov 30, 2018 8:04 am

stratum wrote: ↑Fri Nov 30, 2018 6:52 am On the other hand, I hope you are not comparing the execution speed of MSVC debug builds with IPP and assuming performance improvement!

I've used the same piece of code (except the replaced functions above, obviously), on Release with x86/64 Configuration, using optmization flags /O2 /Ot.

Tried with buffers size from 64 to 256; in both cases they were 2x faster.
They start to "equals" around 32.
Lower, the "homemade" version is faster than IPP.
But I'm assuming that average buffers are over 32 samples per buffer.
Or at least, I'm considering this case for now...

stratum wrote: ↑Fri Nov 30, 2018 6:52 am Depends on the value of blockSize. For very small values IPP function calls will have considerable overhead and intrinsics or custom assembly code could be better.

You have suggested to me IPP, as well others users suggest to use libraries

The reason (I think) is because these library can overload differents SIMD, due to current CPUs. Which make the plugins more accessibles.

I could learn and become a pro (not as you, of course

) SSE2 intrinsics, but what if my plugin will run on a machine without those set? It will fails...
I think you are suggested that the overhead introduced by wrapped libraries is the cost to pay getting your builds more accessible in the real world/hardware.

So, a sort of dispatch table/indirection is always needed, and I believe (as stated) that IPP would do it better the work on both dispatch and intrinsics, than me; that's why I'm trying them.

Or, coming back to original question, which sets do you use? Single one? Multiple? Templated version?
That's also why I've opened the topic, to understand the average hardware and the supports you give to your audio customers

stratum · Post by **stratum** » Fri Nov 30, 2018 9:22 am

This instruction set question is not an issue with IPP. In the past IPP had issues with AMD processors and AMD filed a lawsuit and won. I guess one still cannot assume that Intel has the best optimal code for every processor out there, but one can expect a reasonable job I guess.

Unless you have an algorithm that has a loop with a code inside that operates over small sizes of data (that blockSize value above) IPP is OK and you can forget about the instruction set, intrinsics and assembly code.

Nowhk · Post by **Nowhk** » Fri Nov 30, 2018 9:53 am

stratum wrote: ↑Fri Nov 30, 2018 9:22 am This instruction set question is not an issue with IPP. In the past IPP had issues with AMD processors and AMD filed a lawsuit and won. I guess one still cannot assume that Intel has the best optimal code for every processor out there, but one can expect a reasonable job I guess.

So everybody here use IPP (or other similar libraries) that does automatically the work? Or use single set? Or make your own with dispatch as well? Just curious

stratum wrote: ↑Fri Nov 30, 2018 9:22 amIPP is OK and you can forget about the instruction set, intrinsics and assembly code.

Not sure PurpleSunray have the same opinion about this

"I don't think he will gain anything by simply replacing arithmetic operations with IPP functions, it will rather be slower because you trade a cpu insturction for a library call (including all the overhead)."

PurpleSunray · Post by **PurpleSunray** » Fri Nov 30, 2018 9:59 am

Nah, you do it right and that's why it it is faster. The loop is implemement by IPP and operates on blockSize. ippsThreshold_64f_I is a hand-crafted piece of SIMD assembly code optimized for single CPUs - it will always win the battle against std::clamp + compiler (if not, Intel guys do something wrong).

I was worried about a for-loop on your code where +, - and * are replaced by ippsAdd_32f, ippsSub_32f and ippsMul_32f and called with blocksize=1

So everybody here use IPP (or other similar libraries) that does automatically the work? Or use single set? Or make your own with dispatch as well? Just curious

We used IPP back on old days at Nero AG and was working fine on >100mio installations.

Nowhk · Post by **Nowhk** » Fri Nov 30, 2018 10:13 am

PurpleSunray wrote: ↑Fri Nov 30, 2018 9:59 am Nah, you do it right and that's why it it is faster. The loop is implemement by IPP and operates on blockSize. ippsThreshold_64f_I is a hand-crafted piece of SIMD assembly code optimized for single CPUs - it will always win the battle against std::clamp + compiler (if not, Intel guys do something wrong).

Ffff, saved this time

Yes, it will also win on MKL. If I do the same with MKL, its way slower.
Also ippsThreshold_LTValGTVal_64f_I is slower than a double ippsThreshold_64f_I

PurpleSunray wrote: ↑Fri Nov 30, 2018 9:59 am I was worried about a for-loop on your code where +, - and * are replaced by ippsAdd_32f, ippsSub_32f and ippsMul_32f and called with blocksize=1

Anyway, I always see single add or multiple call, not them together.
I know SIMD can pack different operations once in a single instruction.
Is there in IPP a sort of single ADD & PRODUCT intrinsic?

For something like this, going in cascade from previous example:

Code: Select all

pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange

stratum · Post by **stratum** » Fri Nov 30, 2018 10:24 am

Nowhk wrote: ↑Fri Nov 30, 2018 9:53 am
stratum wrote: ↑Fri Nov 30, 2018 9:22 am This instruction set question is not an issue with IPP. In the past IPP had issues with AMD processors and AMD filed a lawsuit and won. I guess one still cannot assume that Intel has the best optimal code for every processor out there, but one can expect a reasonable job I guess.
So everybody here use IPP (or other similar libraries) that does automatically the work? Or use single set? Or make your own with dispatch as well? Just curious

Well, I'm not a plugin developer and can't speak on their behalf, but my guess it that they do not use IPP.
One reason is that blockSize issue, you won't typically find very small vector sizes in image processing but for audio that's entirely reasonable. For example blockSize=2 is a special case that corresponds to the number of channels. One can write custom code for that.

Another is that music-dsp programming is a speciality for which they even write books that depart from the literature. It's unlikely that IPP developers know about the algorithms in use. But for the basic 'first steps on vectorizing' stuff it should be useable.

PurpleSunray · Post by **PurpleSunray** » Fri Nov 30, 2018 10:32 am

Is there in IPP a sort of single ADD & PRODUCT intrinsic?

Sure, https://software.intel.com/en-us/ipp-de ... ddproductc
"Grouping" stuff together is the whole point about using IPP. The higher the level you cen get, the better it is (assuming you will never beat an Intel coder on wirting code for intel's). Next question would be: why do you need the ADD & PRODUCT? Is it some common DSP? Maybe it is on IPP already?

Your multiply-add is a good a exmaple for this. Such operations are very common, there is even a CPU instructions for it on newer CPUs (FMA aka Fused multiply-add). So you give all the options to IPP. Very old CPU? Run SSE2 with ADD+MUL.. Not so old CPU? Use SSE4 ADD+MUL. New CPU? Use FMA. Or something like that..

PurpleSunray · Post by **PurpleSunray** » Fri Nov 30, 2018 10:55 am

Another is that music-dsp programming is a speciality for which they even write books that depart from the literature. It's unlikely that IPP developers know about the algorithms in use. But for the basic 'first steps on vectorizing' stuff it should be useable.

That one is easy, IPP only has arbitrary order and biquad IIR.
Implemeting the chebyshev or analog ladder filter is still your job.

Edit: I must say that we used IPP for image processing mainly. Don't really know how helpfull it is for audio in terms of pre-implemented algorithms.

stratum · Post by **stratum** » Fri Nov 30, 2018 11:08 am

PurpleSunray wrote: ↑Fri Nov 30, 2018 10:55 am Edit: I must say that we used IPP for image processing mainly. Don't really know how helpfull it is for audio in terms of pre-implemented algorithms.

My poor memory recalls it's a project that was originally intiated to accelerate OpenCV, but I could be wrong. Audio is not their primary focus, in the past IPP had speech processing code and they have removed it.

Nowhk · Post by **Nowhk** » Tue Dec 04, 2018 7:52 am

PurpleSunray wrote: ↑Fri Nov 30, 2018 10:32 am Sure, https://software.intel.com/en-us/ipp-de ... ddproductc
"Grouping" stuff together is the whole point about using IPP.

That's more a:

Code: Select all

pValue[sampleIndex] += pValue[sampleIndex] * mRange

Rather than:

Code: Select all

pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange

At least, if I dont pre-fill all pValue values with the mMin values.
But isn't this a bit redundant?
I should do it every time

PurpleSunray wrote: ↑Fri Nov 30, 2018 10:32 am The higher the level you cen get, the better it is (assuming you will never beat an Intel coder on wirting code for intel's). Next question would be: why do you need the ADD & PRODUCT? Is it some common DSP? Maybe it is on IPP already?

I simply first add to a param (normalized) the modulation amount (normalized), clamping it:

Code: Select all

pValue[sampleIndex] = std::clamp(pStart[sampleIndex] + pMod[sampleIndex], 0.0, 1.0);
//become...
ippsAdd_64f(pStart, pMod, pValue, blockSize);
ippsThreshold_64f_I(pValue, blockSize, 0.0, ippCmpLess);
ippsThreshold_64f_I(pValue, blockSize, 1.0, ippCmpGreater);

Than, I de-normalize it (i.e. real values):

Code: Select all

pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange
// become...
... I'm trying to understand...

Last (if needed; for some param such as Pitch I need, as for my previous topic), I woud need to apply an exp function:

Code: Select all

pValue[sampleIndex] = exp(pValue[sampleIndex] * ln2per12);
// become...
ippsExp_64f_I(pValue, blockSize); // missing the  * ln2per12

Not sure if there's a "unique" IPP wrapper for this, I really doubt so

stratum · Post by **stratum** » Tue Dec 04, 2018 10:31 am

"pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange "
looks like "Set" followed by "AddProductC"

https://software.intel.com/en-us/ipp-dev-reference-set
https://software.intel.com/en-us/ipp-de ... ddproductc

Nowhk · Post by **Nowhk** » Tue Dec 04, 2018 11:17 am

stratum wrote: ↑Tue Dec 04, 2018 10:31 am "pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange "
looks like "Set" followed by "AddProductC"

https://software.intel.com/en-us/ipp-dev-reference-set
https://software.intel.com/en-us/ipp-de ... ddproductc

Yes, as I meant "pre fill" a vector of data. But since its temp, and AddProduct use only pSrcDst, I need to pre-fill that temp vector every time I need to do this operation. That's why it seems to me "redundant", no?

stratum · Post by **stratum** » Tue Dec 04, 2018 11:30 am

Nowhk wrote: ↑Tue Dec 04, 2018 11:17 am
stratum wrote: ↑Tue Dec 04, 2018 10:31 am "pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange "
looks like "Set" followed by "AddProductC"

https://software.intel.com/en-us/ipp-dev-reference-set
https://software.intel.com/en-us/ipp-de ... ddproductc
Yes, as I meant "pre fill" a vector of data. But since its temp, and AddProduct use only pSrcDst, I need to pre-fill that temp vector every time I need to do this operation. That's why it seems to me "redundant", no?

This one works:

Code: Select all

	Ipp32f src[3];
	src[0] = 1.0f;
	src[1] = 1.1f;
	src[2] = 1.2f;
	ippsAddProductC_32f(src, 2.0f, src, 3);
	for (int i = 0; i < 3; i++)
		std::cout << src[i] << std::endl;

But is just another indication of the fact that IPP documentation leaves a few things to be desired.

Nowhk · Post by **Nowhk** » Tue Dec 04, 2018 6:52 pm

stratum wrote: ↑Tue Dec 04, 2018 11:30 am
Nowhk wrote: ↑Tue Dec 04, 2018 11:17 am
stratum wrote: ↑Tue Dec 04, 2018 10:31 am "pValue[sampleIndex] = mMin + pValue[sampleIndex] * mRange "
looks like "Set" followed by "AddProductC"

https://software.intel.com/en-us/ipp-dev-reference-set
https://software.intel.com/en-us/ipp-de ... ddproductc
Yes, as I meant "pre fill" a vector of data. But since its temp, and AddProduct use only pSrcDst, I need to pre-fill that temp vector every time I need to do this operation. That's why it seems to me "redundant", no?
This one works:
Code: Select all
	Ipp32f src[3];
	src[0] = 1.0f;
	src[1] = 1.1f;
	src[2] = 1.2f;
	ippsAddProductC_32f(src, 2.0f, src, 3);
	for (int i = 0; i < 3; i++)
		std::cout << src[i] << std::endl;
But is just another indication of the fact that IPP documentation leaves a few things to be desired.

Uhm? You are overwriting over src, so you can't use it filled with mMin value. Else, the next time you would do it again, you need to refill src.

I mean, if I do ippsAddProductC_32f(pValue, mRange, pMin, blockSize), it will overwrite to pMin as result.
Next time, I need to "reset" it again setting all of its values to mMin...

stratum · Post by **stratum** » Tue Dec 04, 2018 7:30 pm

Next time, I need to "reset" it again setting all of its values to mMin...

yes.
If that does not work well, IPP documentation is long and you can still look for a better option (if any exists) or switch to some hand written code.

First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?