Ideas for an article on SIMD optimisations

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

I'm going to start an article related to techniques for optimising DSP algorithms with SIMD, in any possible way.

Therefore I would like to collect the maximum developer's ideas and opinions about encountered problems and/or their possible solutions and compromises. For example: block processing with short feedback, multirate processing, look-up tables, conditional code, etc.

Or links on existing articles, if any... :D

Thank you

-- Laurent

Post

Excellent idea Laurent... You could also address different strategies, like using SSE to process 4 voices at once vs. using SSE to optimize the code for 1 single voice...
Since I'm deep inside SSE at the moment I'm very interested. I could contribute some snippets, although the challenge I'm facing at the moment is how to move from a naive SSE implementation to an optimized one. There are no tools to help doing this as far as I know (maybe Intel VTune, but it's quite expensive).

'Tick

Post

Big Tick wrote:You could also address different strategies, like using SSE to process 4 voices at once vs. using SSE to optimize the code for 1 single voice...
Yes, it was what I meant by "in any possible way". I identified three main cases:
- Block processing (N samples)
- N-ways (classic one, N voices at once)
- N-serial (one sample, N chained identical modules)
I could contribute some snippets
Great !

-- Laurent

Post

I've got a bizarro SSE-ized chamberlin SVF that I use in Meridian that might be worth writing up. The code path implements a 12dB/octave filter, but the output of elements 1 and 2 is always used as the next input to elements 3 and 4, so in effect you have two filters with both 12dB and 24dB-and-one-sample-delay outputs - so it's a hybrid of 2-way serial and 2-way parallel vectorization. I suspect I'd be better off ditching it, actually, but the idea might be useful and I could write it up in more detail.
Image
Don't do it my way.

Post

OK, here is a first snippet. It sums up all 4 parts of xmm0. You typically need this for the final mixing of N-ways processing.

Code: Select all

// xmm0 = 1 2 3 4
movhlps xmm1, xmm0 // xmm1 = 3 4 3 4
addps xmm1, xmm0 // 1+3 2+4 3+3 4+4
movaps xmm0, xmm1
shufps xmm0, xmm0, 1 // 2+4 ...
addss xmm0, xmm1  // (1+3)+(2+4)
As is often the case with SSE, this code needs to be interleaved with other stuff for best performance.
'Tick

Post

it would be interesting to compare SSE and altivec functions, so cross platform developers can optimize for both Mac and x86 machines.

Peter

Post Reply

Return to “DSP and Plugin Development”