KVR Audio

Fire Sledge - Ohm Force · Wed Apr 06, 2005 10:39 am

I'm going to start an article related to techniques for optimising DSP algorithms with SIMD, in any possible way.

Therefore I would like to collect the maximum developer's ideas and opinions about encountered problems and/or their possible solutions and compromises. For example: block processing with short feedback, multirate processing, look-up tables, conditional code, etc.

Or links on existing articles, if any...

Thank you

-- Laurent

Big Tick · Post by **Big Tick** » Wed Apr 06, 2005 12:57 pm

Excellent idea Laurent... You could also address different strategies, like using SSE to process 4 voices at once vs. using SSE to optimize the code for 1 single voice...
Since I'm deep inside SSE at the moment I'm very interested. I could contribute some snippets, although the challenge I'm facing at the moment is how to move from a naive SSE implementation to an optimized one. There are no tools to help doing this as far as I know (maybe Intel VTune, but it's quite expensive).

'Tick

Fire Sledge - Ohm Force · Wed Apr 06, 2005 2:57 pm

Big Tick wrote:You could also address different strategies, like using SSE to process 4 voices at once vs. using SSE to optimize the code for 1 single voice...

Yes, it was what I meant by "in any possible way". I identified three main cases:
- Block processing (N samples)
- N-ways (classic one, N voices at once)
- N-serial (one sample, N chained identical modules)

I could contribute some snippets

Great !

-- Laurent

Borogove · Post by **Borogove** » Wed Apr 06, 2005 3:26 pm

I've got a bizarro SSE-ized chamberlin SVF that I use in Meridian that might be worth writing up. The code path implements a 12dB/octave filter, but the output of elements 1 and 2 is always used as the next input to elements 3 and 4, so in effect you have two filters with both 12dB and 24dB-and-one-sample-delay outputs - so it's a hybrid of 2-way serial and 2-way parallel vectorization. I suspect I'd be better off ditching it, actually, but the idea might be useful and I could write it up in more detail.

Big Tick · Post by **Big Tick** » Wed Apr 06, 2005 7:48 pm

OK, here is a first snippet. It sums up all 4 parts of xmm0. You typically need this for the final mixing of N-ways processing.

Code: Select all

// xmm0 = 1 2 3 4
movhlps xmm1, xmm0 // xmm1 = 3 4 3 4
addps xmm1, xmm0 // 1+3 2+4 3+3 4+4
movaps xmm0, xmm1
shufps xmm0, xmm0, 1 // 2+4 ...
addss xmm0, xmm1  // (1+3)+(2+4)

As is often the case with SSE, this code needs to be interleaved with other stuff for best performance.
'Tick

peteblues · Post by **peteblues** » Fri Apr 08, 2005 9:19 pm

it would be interesting to compare SSE and altivec functions, so cross platform developers can optimize for both Mac and x86 machines.

Peter

Ideas for an article on SIMD optimisations