KVR Audio

Miles1981 · Post by **Miles1981** » Tue Sep 23, 2014 2:06 pm

If you have virtual functions, you are kind of screwed. FFTW doesn't have selected frequencies, but as it prepares everything before (all coefficients), it is more or less just 2 for loops with some additional multiplies and adds. So it can be really fast, without any overhead.

avasopht · Post by **avasopht** » Tue Sep 23, 2014 2:38 pm

Well code generation is always an option. And in any case this does not depend on virtual calls, it could use conditionals. But I don't mind code generation.

avasopht · Post by **avasopht** » Wed Sep 24, 2014 1:04 pm

Okay, so I've been calculating the cost savings of not processing redundant paths and I have discovered the following.

Reading / writing samples periodically (i.e. every 4 samples) yields real benefits (even if just for every other sample). Based on the set of costs given below I got the following core results for selectively transforming to and from the time domain.

Code: Select all

Cost:
- Multiply by twiddle factor: 7 (complex multiply = 4 multiplies and 3 add/sub)
- Multiply by twiddles at multiples of pi/2 radians: 2 (takes 2 add/sub)
- Execution of path cost: 4 (assumes at least 2 arithmetic operations plus an arbitrary cost of execution flow

Code: Select all

Size        |    First 64     |  Every 2 | Every 4   |    all
------------|-----------------|----------|-----------|---------
2^11        |    117644       |  82432   |  52724    |  149516   
2^16        |    3923084      | 3866624  |  2301940  |  7241740
2^17        |    7854476      | 8224768  |  4849652  |  15466508

It seems that writing to the first 64 samples only yields any benefit with larger sizes, but transforming to and from every other sample is another story. Conversely, the opposite should be true for the frequency domain, where more benefits ought to come from tailoring for successive frequency groups than a periodic selection.

I would have to create a more *complicated test to see what happens with selections on both the time and frequency domain (for example, you might be performing an overlap on a decimation filter, and only want to output 10hz to 1000hz to every 4 samples).

I'd imagine though, that it wouldn't fair well against the a fully optimised SSE **version, even in the case of partial rendering of large transform. Again, this needs to be benchmarked given the potential for auto vectorisation.

*: well not that complicated
**: I'm writing this in plain C/C++

Miles1981 · Post by **Miles1981** » Wed Sep 24, 2014 1:16 pm

If you want to have nice C++ code with vectorisation, take a look at Boost.SIMD. Best library ever.

avasopht · Post by **avasopht** » Wed Sep 24, 2014 4:25 pm

I will check that out. Just FYI I'm writing this to support Rack Extensions, which does not allow processor specific optimisations (so have to rely on auto vectorisation for SIMD).

On another note, I've done some research and it seems that the speed gains from SSE alone can be in the region of 800% and that mixed radix (or at least radix 4) yields even further gains so what I'm writing can only really be (until such further optimisations are made) a proof of concept of the pruning of calculations - bearing in mind I started this just for the sake of learning how to write my own though my #1 priority is its application so I'll slow down on this as I've learnt what I needed to, and if anything this can be a fun exercise.

That being said, the potential 200-300% speedup when transforming to/from every 4 samples is rather interesting. Even the 100% speedup when outputting to the first 64 samples with larger transforms is very attractive.

I wonder if fftw's "wisdom" is making these sorts of optimizations?

Miles1981 · Post by **Miles1981** » Wed Sep 24, 2014 4:32 pm

Auto vectorization is processor specific optimization... RE doesn't impede you from doing processor specific optimizations. After all, the Intel compiler has an option to do autovectorization for specific processor with a dispatch call.
The main think is how much memory is used for how many computations. If your computation is memory bound, vectorization won't bring you that much speedup.
FFTW has different implementation for different radix and selects the fastest for the processor on which it is compiled. I don't think there is any intrinsics inside.

avasopht · Post by **avasopht** » Thu Sep 25, 2014 10:53 am

Yes, that's what I meant sorry, basically RE doesn't allow assembly code or intrinsics but does perform auto vectorisation. And if FFTW really does just operate on the C *domain, that's reassuring.

A little analysis pointed out the obvious reason why optimising transforms for input of every other index gives large gains while successive blocks of don't. Basically evens are cheaper to calculate. So really the gains are in interpolation and decimation, even for a 9-tap filter.

*: Edit, just checked fftw source and it does use intrinsics

Miles1981 · Post by **Miles1981** » Thu Sep 25, 2014 1:06 pm

If the compiler authorizes assembly and has intrinsics, I can't understand why RE would impede you for using them, as auto-vectorization is exactly the same, uses the same instructions in the end!

avasopht · Post by **avasopht** » Thu Sep 25, 2014 2:11 pm

Miles1981 wrote:If the compiler authorizes assembly and has intrinsics, I can't understand why RE would impede you for using them, as auto-vectorization is exactly the same, uses the same instructions in the end!

Basically the RE SDK is designed for absolute platform and processor agnosticism so those features are blocked and you can only rely on auto-vectorization, which apparently isn't a problem in fft's.

RFFT (rather Flexible Fourier Transform), ...