KVR Audio

stratum · Post by **stratum** » Thu Nov 22, 2018 2:27 pm

Non-Intel CPU's do not fail, but Intel's code may not be optimal for them. There are also initialization functions that need to be called for proper execution. For IPP, that's ippInit (or maybe ippInitStatic). That selects the optimal code for the CPU in use. For MKL, I don't recall needing to call any initialization function, but that's a possibility as we rarely use MKL.

Nowhk · Post by **Nowhk** » Thu Nov 22, 2018 8:46 pm

stratum wrote: ↑Thu Nov 22, 2018 2:27 pm Non-Intel CPU's do not fail, but Intel's code may not be optimal for them. There are also initialization functions that need to be called for proper execution. For IPP, that's ippInit (or maybe ippInitStatic). That selects the optimal code for the CPU in use. For MKL, I don't recall needing to call any initialization function, but that's a possibility as we rarely use MKL.

Oh nice! It seems I don't need to do any extra effort, since by default it dispatch the optimal function automatically, due to the current/running CPU: https://software.intel.com/en-us/articl ... -functions

stratum · Post by **stratum** » Thu Nov 22, 2018 9:20 pm

If ippInit fails to detect an AMD processor correctly you can also use this https://software.intel.com/en-us/ipp-de ... pufeatures Not really practical to test though, as you would need many different machines.

syntonica · Post by **syntonica** » Sun Nov 25, 2018 6:46 am

I played around with auto-vectorization using pragma hints, but the compiler just kinda laughed at my code and threw up its hands. Unfortunately, not much of what I do lends itself to vectorization. There's nothing that I use multiply/adds for that use constants--any scalars are usually computed per sample. I finally gave up. The compiler already gets me 40-50% on the Mac side of things. I forget what I get on the PC side, but it's quite a bit less, just due to the compiler. The MS defaults seem to do quite a bit of optimization automatically. In the end, I get about the same end result (my PC specs are almost identical to my Mac specs.)

That said, do learn about how your compiler works so you can set up your code for the compiler to take advantage of it. Keep things simple in your for loops, especially.

And I do recommend using a SIMD framework as it will save you a ton of time and hair loss if you do want to go down this road.

Nowhk · Post by **Nowhk** » Sun Nov 25, 2018 9:33 am

Yeah, in fact I'm trying that IPP, which seems nice

Do you static or dynamic link those libraries? For the second I believe you include the DLLs in your bundle install.

Any performance differences? Expecially when dispatching...

Urs · Post by **Urs** » Sun Nov 25, 2018 11:07 am

Auto-Vecorization *never* worked for me. Never. The only wonder I've ever seen was for this loop: for( int i = 0; i < 256; i++ ) var[ i ] = i; - the compiler did amazing things for this. Never firgured out how it worked.

We use vector intrinsics wrapped into objects. This usually gives us 2x the performance over scalar code. I more and more use templated functions so that scalar code and vectorized code are identical, and I just implement for either float or float vector. This makes the scalar code a tad slower (no conditional branches), but then it's only there for reference anyway.

Wrapping intrinsics into objects has helped the transistion from PowerPC to Intel, and possibly to ARM as well (whenever/if that's taking over).

stratum · Post by **stratum** » Sun Nov 25, 2018 11:13 am

Nowhk wrote: ↑Sun Nov 25, 2018 9:33 am Yeah, in fact I'm trying that IPP, which seems nice
Do you static or dynamic link those libraries? For the second I believe you include the DLLs in your bundle install.

If it's for a plugin you should use the static version otherwise those shared libraries will clutter the user's plugin folder.

Any performance differences? Expecially when dispatching...

No. But if you are sure about the CPU type (for example AVX only), then there is a way to avoid dispatching altogether https://software.intel.com/en-us/articl ... ce-guide#3 section named "Single Processor Static Linkage"

mystran · Post by **mystran** » Sun Nov 25, 2018 12:33 pm

Urs wrote: ↑Sun Nov 25, 2018 11:07 am We use vector intrinsics wrapped into objects. This usually gives us 2x the performance over scalar code. I more and more use templated functions so that scalar code and vectorized code are identical, and I just implement for either float or float vector. This makes the scalar code a tad slower (no conditional branches), but then it's only there for reference anyway.

With regards to branches: ISPC uses a strategy similar to GPUs, where you maintain masks for conditions, then evaluate all branches (using predication, which if not supported by hardware can be emulated by bitwise logic or similar) that are required for at least one "thread." This method can support essentially arbitrary control flow. For example with loops you simply keep looping until your mask indicates that all the "threads" are done. Obviously you only get full benefit from SIMD when your control-flow is reasonable coherent.

Ichad.c · Post by **Ichad.c** » Sun Nov 25, 2018 1:32 pm

Yeah, auto-vectorization tends to not work most of the time, especially with the fun stuff(i.e. anything with feedback). Sometimes just rearranging an algorithm can help a bit in cases where there is vector and scalar code in use at the same time, forward stalls can eat performance, especially on older machines(i.e. Sandy Bridge). I really should just sit down one day and figure out how to do a simple 1 pole 1 zero in vector form, they have a habit of breaking my algorithm intrinsic sequence often.

A template/wrapper is a bit of upfront work but will save you a lot of time in the long run.

P.S. Somebody at Intel had Fabrication-Diarrhea when they did AVX512, all those different versions seems pretty half-baked to me.

Nowhk · Post by **Nowhk** » Sun Nov 25, 2018 1:35 pm

stratum wrote: ↑Sun Nov 25, 2018 11:13 am But if you are sure about the CPU type (for example AVX only), then there is a way to avoid dispatching altogether

Why one would do this? AVX is not supported by all CPUs. Users without them won't be able to use the plug. Isn't better deal the dispatch? At least it works not optmizied, but still works

Is dispatch so heavy?

stratum · Post by **stratum** » Sun Nov 25, 2018 2:50 pm

Nowhk wrote: ↑Sun Nov 25, 2018 1:35 pm
Why one would do this? AVX is not supported by all CPUs. Users without them won't be able to use the plug. Isn't better deal the dispatch? At least it works not optmizied, but still works

Is dispatch so heavy?

Roughly the same as calling a virtual function, probably. Not an issue with large amount of data (image processing). For audio, I don't know (didn't measure).

Richard_Synapse · Post by **Richard_Synapse** » Sun Nov 25, 2018 4:21 pm

Nowhk wrote: ↑Sun Nov 25, 2018 1:35 pm Why one would do this? AVX is not supported by all CPUs. Users without them won't be able to use the plug. Isn't better deal the dispatch? At least it works not optmizied, but still works

Is dispatch so heavy?

Of course dispatching is fine when it works. But when it doesn't work, it's worse than not having it at all, because your plugins will simply crash without any warning message. Note that there's more than one potential source for crashing, the CPU, the OS, as well as the packages you use (such as IPP) come to mind.

Richard

Nowhk · Post by **Nowhk** » Mon Nov 26, 2018 8:18 am

Richard_Synapse wrote: ↑Sun Nov 25, 2018 4:21 pm
Nowhk wrote: ↑Sun Nov 25, 2018 1:35 pm Why one would do this? AVX is not supported by all CPUs. Users without them won't be able to use the plug. Isn't better deal the dispatch? At least it works not optmizied, but still works

Is dispatch so heavy?
Of course dispatching is fine when it works. But when it doesn't work, it's worse than not having it at all, because your plugins will simply crash without any warning message. Note that there's more than one potential source for crashing, the CPU, the OS, as well as the packages you use (such as IPP) come to mind.

Richard

For what I've got from IPP, it resolve (by default) at runtime which override function to call, due to the current CPU. So in the worst case (i.e. no CPU match), it will call the "basic" one, not optimized.
Not sure what do you mean with "crash" here.

Any example?

stratum · Post by **stratum** » Mon Nov 26, 2018 8:58 am

IPP doesn't crash because it fails to detect CPU. Rather it may crash because on your own machines it may dispatch to AVX and SSE2 (assuming those are your test CPUs) whereas it may have a bug in MMX implementation which might be a forgotten piece of rusty code, and a customer may discover that. Just theoretical you may say, but a possibility nevertheless.

Richard_Synapse · Post by **Richard_Synapse** » Mon Nov 26, 2018 12:58 pm

Nowhk wrote: ↑Mon Nov 26, 2018 8:18 am For what I've got from IPP, it resolve (by default) at runtime which override function to call, due to the current CPU. So in the worst case (i.e. no CPU match), it will call the "basic" one, not optimized.
Not sure what do you mean with "crash" here.

Like I wrote, if it works properly then yes. But compilers and libraries are not free of bugs, dispatch issues have been around ever since SSE (of course they get fixed at some point, so you may want to google if your specific compiler version or library has such issues or not).

A crash is typically caused by an illegal instruction.

Richard

First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?