I've used the same piece of code (except the replaced functions above, obviously), on Release with x86/64 Configuration, using optmization flags /O2 /Ot.
Tried with buffers size from 64 to 256; in both cases they were 2x faster.
They start to "equals" around 32.
Lower, the "homemade" version is faster than IPP.
But I'm assuming that average buffers are over 32 samples per buffer.
Or at least, I'm considering this case for now...
You have suggested to me IPP, as well others users suggest to use libraries
The reason (I think) is because these library can overload differents SIMD, due to current CPUs. Which make the plugins more accessibles.
I could learn and become a pro (not as you, of course ) SSE2 intrinsics, but what if my plugin will run on a machine without those set? It will fails...
I think you are suggested that the overhead introduced by wrapped libraries is the cost to pay getting your builds more accessible in the real world/hardware.
So, a sort of dispatch table/indirection is always needed, and I believe (as stated) that IPP would do it better the work on both dispatch and intrinsics, than me; that's why I'm trying them.
Or, coming back to original question, which sets do you use? Single one? Multiple? Templated version?
That's also why I've opened the topic, to understand the average hardware and the supports you give to your audio customers