KVR Audio

ndivuyo · Post by **ndivuyo** » Wed Nov 01, 2023 11:31 pm

I am trying to implement an FFT in a shader to run on the GPU. I grabbed the algorithm from here: https://github.com/jbouny/fft-ocean/blo ... nShader.js

However, when implemented and compared with other FFT objects (which I'm doing in max/msp) the results of the above algorithm don't seem to match at all.

Would anyone be able to tell me if there is an issue with the algorithm or implementation? If nothing is wrong with the algorithm then there must be an issue with how I am sending the values or something...

Some points to consider:

1. I am performing the FFT on audio samples which are 1D instead of an image. The input I'm sending it is a window of 1024 audio samples, (ie vec4(sample,0,0,0) ) the dimensions are 1024x1.

2. I don't really understand the Subtransform size and it's hard to find clear info about it from google searching, so I am not sure if I'm setting it correctly.

3. I set the transform size as the FFT size, is that correct?

Here is the algorithm implemented in my shader:

Code: Select all

const float TWO_PI_NEG = -6.2831853072;
varying vec2 texcoord0;
uniform sampler2D tex0;
uniform float fftSize = 1024;
uniform float subtransformSize = 4;
uniform float subtransformSizeHalf = 2;
uniform float horizontal = 1;
float fftSizeHalf = fftSize * 0.5;



//
vec2 multiplyComplex (vec2 a, vec2 b) {
    return vec2(a[0] * b[0] - a[1] * b[1], a[1] * b[0] + a[0] * b[1]);
}


//GPU FFT using a Stockham formulation
vec4 fft (
    sampler2D src, 
    vec2 coord,
    float transformSize, 
    float transformSizeHalf,
    float subtransformSize,
    float subtransformSizeHalf,
    float horizontal
) {
    float index;
    if (horizontal == 1) {
        index = coord.x * transformSize - 0.5;
    } else {
        index = coord.y * transformSize - 0.5;
    }

    float evenIndex = floor(index / subtransformSize) * subtransformSizeHalf + mod(index, subtransformSizeHalf);

    //transform two complex sequences simultaneously
    vec4 even;
    vec4 odd;
    if (horizontal == 1) {
        even = texture2D(src, vec2(evenIndex + 0.5, gl_FragCoord.y) / transformSize).rgba;
        odd = texture2D(src, vec2(evenIndex + transformSizeHalf + 0.5, gl_FragCoord.y) / transformSize).rgba;
    } else {
        even = texture2D(src, vec2(gl_FragCoord.x, evenIndex + 0.5) / transformSize).rgba;
        odd = texture2D(src, vec2(gl_FragCoord.x, evenIndex + transformSizeHalf + 0.5) / transformSize).rgba;
    }

    float twiddleArgument1D = TWO_PI_NEG * (index / subtransformSize);
    vec2 twiddle1D = vec2(cos(twiddleArgument1D), sin(twiddleArgument1D));

    vec2 outputA = even.xy + multiplyComplex(twiddle1D, odd.xy);  //even.xy
    vec2 outputB = even.zw + multiplyComplex(twiddle1D, odd.zw); //even.zw
    return vec4(outputA, outputB);
}



//
void main(void){
    gl_FragColor = fft(tex0, texcoord0, fftSize, fftSizeHalf, subtransformSize, subtransformSizeHalf, 1.);
}

Thanks for any help!

mystran · Post by **mystran** » Thu Nov 02, 2023 6:35 am

Don't know if the code is correct or not, but it does a single pass over the data. FFT is a recursive algorithm: for Cooley-Tukey (on which the Stockham formulation is built) we start with the N point transform in the first pass, which breaks it down into two smaller transforms of half the size, which are then further broken down into smaller transforms, until you have small transform that you can evaluate directly... or it can be done the other way as well, starting with many small transforms and then gradually combining them.

So.. really the shader itself is just part of the algorithm and since this is written as a fragment shader (for WebGL) you need to use the shader to ping-pong the data (1024=2^10 so we need 10 passes) between two textures to fully evaluate the FFT. Combined with the fact that GPUs run fragment shaders as 2x2 blocks, so 1024x1 actually wastes half the shader invocations and the fact that 1024 (or well, 2048 since we're wasting half) concurrent invocations is not a whole lot utilization, this is not necessarily all that efficient (compared to just using CPU) unless you can evaluate many transforms at once, or you can avoid moving the data back&forth between CPU and GPU memory.

ndivuyo · Post by **ndivuyo** » Thu Nov 02, 2023 7:58 pm

Thanks!! This will be used to process samples offline, which I will do many frames simultaneously as well as extracting features like mel, bark, etc. So I think this will make GPU desirable(??), it's definitely not for real-time application nor doing a single frame at a time I was just trying to get a single frame to work!

Thanks for the advice! I posted the issue on a few other places and I was given some iterative solutions so hopefully it'll work out ok (haven't had the time to try it).

If anyone is curious here are some links I was given:

https://www.shadertoy.com/view/lsVXWt

https://github.com/dli/waves/blob/maste ... #L566-L593

mystran · Post by **mystran** » Thu Nov 02, 2023 8:45 pm

ndivuyo wrote: ↑Thu Nov 02, 2023 7:58 pm Thanks!! This will be used to process samples offline, which I will do many frames simultaneously as well as extracting features like mel, bark, etc. So I think this will make GPU desirable(??), it's definitely not for real-time application nor doing a single frame at a time I was just trying to get a single frame to work!

Oh, that makes sense.

With Shadertoy, keep in mind that because one cannot control code, rather only ping-pong fixed buffers, sometimes doing "fancy things" involves doing things in weird ways.

The javascript looks like what I was describing though with the multi-pass ping-pong.

If you're not stuck with pixel shaders though, using compute would probably be more efficient.

ndivuyo · Post by **ndivuyo** » Thu Nov 02, 2023 9:49 pm

Unfortunately I can only use fragment shaders (using max/msp jit.gl.slab object). I see! Yea been going through each of them and while both are a headache to port (for someone like me where this is taking a lot to understand) the shadertoy is giving me an even greater headache so I might try porting the js one since it's using the original solution I was trying.

ndivuyo · Post by **ndivuyo** » Thu Nov 02, 2023 10:43 pm

Ok so I am trying to understand the js code to port it. I am new to shaders (literally started learning them 2 days ago) and so I'm a little unsure about things but maybe someone can confirm if I am understanding this correctly/clarify some things. First off here is the code I am referring to (this is just the loop, the code used by the loop is in my first post)

https://github.com/dli/waves/blob/maste ... #L566-L593

So let me psuedocode each iteration of the loop to make sure I am understanding...

- Determine the number of iterations: "log2(FFT_SIZE) * 2" I am not sure if this should be half of the size if I am only doing 1D audio samples instead of a 2D image.

////////////// for each iteration: /////////////////////////

- if the first iteration: set the output (Frame Buffer?) to the ping buffer, set the input('u_input') to SPECTRUM_UNIT (which is the original audio samples????)

- else If the last iteration: set the output to I suppose wherever I want to send the final results(?) Set the input to: (iterations % 2 == 0) ? Ping buffer : Pong buffer

- else if the iteration is an odd index: Set the output to pong and input to ping

- else if the iteration is an even index: Set the output to ping and the input to pong

- if in the halfway index: switch to vertical processing instead of horizontal. For my purposes (1D audio samples) I don't think I should do this, correct??? And if so that is why I am wondering if I do half as many iterations like mentioned earlier

- Set the Subtransformsize to: "Math.pow(2,(i % (iterations / 2)) + 1)"

At the end of each loop it says "gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);" I assume this will run the program

/////////////////////// END /////////////////////////////

Now when I say set the input to ping/pong etc, in the code it is done with units like SPECTRUM_UNIT, PING_TRANSFORMATION_UNIT, etc and they are initialized in another file like so:

var INITIAL_SPECTRUM_UNIT = 0,
SPECTRUM_UNIT = 1,
DISPLACEMENT_MAP_UNIT = 2,
NORMAL_MAP_UNIT = 3,
PING_PHASE_UNIT = 4,
PONG_PHASE_UNIT = 5,
PING_TRANSFORM_UNIT = 6,
PONG_TRANSFORM_UNIT = 7;

In functions calls such as:
gl.uniform1i(subtransformProgram.uniformLocations['u_input'], PING_TRANSFORM_UNIT);

TBH I'm not completely sure how this is working other than that uniform1i() can change the value of a 1D uniform integer in the program (I think??), but if I am understanding the general concept correctly I don't think it matters because I will set it up a little differently in my program. My guess is they are indices that relate to the different buffers to store and recall data from(?).

So that was a lot, but to recap my questions are:
- Is my psuedo code explanation of the loop correct and if not where did I go wrong?
- Is my assumption that only Horizontal processing is needed (and not Vertical) because I am only doing it on 1D audio sample buffer?
- Is "log2(FFT_SIZE) * 2" there right number of iterations to do? That I can also mess with on my own if unsure
- It seems like the "transformSize" parameter is set to RESOLUTION which is 512 in the example, am I safe to assume RESOLUTION is for me the FFT size?

Thanks so much!!!!

mystran · Post by **mystran** » Fri Nov 03, 2023 9:15 am

ndivuyo wrote: ↑Thu Nov 02, 2023 10:43 pm - Determine the number of iterations: "log2(FFT_SIZE) * 2" I am not sure if this should be half of the size if I am only doing 1D audio samples instead of a 2D image.

The code is doing both horizontal and vertical transform in the same loop. There's a condition where once it's done half the iterations, it swaps to vertical. For 1D transforms, you need log2(FFT_SIZE) iterations with radix-2.

- if the first iteration: set the output (Frame Buffer?) to the ping buffer, set the input('u_input') to SPECTRUM_UNIT (which is the original audio samples????)
- else If the last iteration: set the output to I suppose wherever I want to send the final results(?) Set the input to: (iterations % 2 == 0) ? Ping buffer : Pong buffer
- else if the iteration is an odd index: Set the output to pong and input to ping
- else if the iteration is an even index: Set the output to ping and the input to pong

Yeah. You can't render to the same texture you're reading in WebGL or traditional OpenGL shaders, so what we have to do is keep two temporary textures, one of we are reading and the other one we're rendering to. On first iteration we read the input instead.

- if in the halfway index: switch to vertical processing instead of horizontal. For my purposes (1D audio samples) I don't think I should do this, correct??? And if so that is why I am wondering if I do half as many iterations like mentioned earlier

Yes. If you only want horizontal transform, you'd break out of the loop here.

- Set the Subtransformsize to: "Math.pow(2,(i % (iterations / 2)) + 1)"

Right, so the subtransform size is 2,4,8,16...FFT_SIZE.

At the end of each loop it says "gl.drawArrays(gl.TRIANGLE_STRIP, 0, 4);" I assume this will run the program

This basically tells GL to draw 4 vertices as a triangle strip (of two triangles). Now the next part might sound a bit complicated if you've never done GL, but basically there is what is known as a "vertex array object" (aka. VAO) which is not actually an array of vertices, but rather an array of vertex pointer declarations. I think the code is using the default one, but if you used (desktop) OpenGL core profiles you'd need one explicitly.

Then we have actual buffer object storing the actual vertices.

Code: Select all

    var fullscreenVertexBuffer = gl.createBuffer();
    gl.bindBuffer(gl.ARRAY_BUFFER, fullscreenVertexBuffer);
    gl.bufferData(gl.ARRAY_BUFFER, new Float32Array([-1.0, -1.0, -1.0, 1.0, 1.0, -1.0, 1.0, 1.0]), gl.STATIC_DRAW);

Buffer objects in GL are literally just generic buffers to hold data (any data), so then we need to configure the VAO on how to fetch the vertices from this buffer:

Code: Select all

        gl.bindBuffer(gl.ARRAY_BUFFER, fullscreenVertexBuffer);
        gl.enableVertexAttribArray(0);
        gl.vertexAttribPointer(0, 2, gl.FLOAT, false, 0, 0);

Note that these are a bit spread around in the original code, because it's all state that GL remembers until you set something else, but what we do here is say is that the vertex attribute 0 is enabled and it should be fetched as a two component float from the buffer bound at the time of vertexAttribPointer call, which is fullscreenVertexBuffer.

This is strictly speaking a little bit wasteful, because due to how pixel shaders always run in 2x2 squares, rendering two triangles as a quad results in the diagonal line being computed twice, which can be avoided for the full-screen case by only drawing one triangle and making it so large that it covers the whole screen.. but that's a minor optimization.

Now.. do we absolute need a vertex buffer?

Not necessarily. It's also possible to use a vertex shader that looks something like this:

Code: Select all

            const vec2 vert[3]= vec2[3](vec2(-1.,-1.),vec2(3.,-1.),vec2(-1.,3.));
            void main() { gl_Position = vec4(vert[gl_VertexID], 0., 1.); }

Note that this is the "large enough triangle" that I was talking about. The downside of this approach is that gl_VertexID is not in GLES2.0 (aka. WebGL1), so I think you need WebGL2 (ie. basically GLES3.0) for this to work on the web, but it's basically always been there on desktop. With this kind of vertex shader, you could just draw a single triangle gl.drawArrays(gl.TRIANGLES, 0, 3); and it doesn't need any actual vertex data to be bound. Other than boilerplate though, defining a buffer of 3-4 vertices isn't a big deal.

TBH I'm not completely sure how this is working other than that uniform1i() can change the value of a 1D uniform integer in the program (I think??), but if I am understanding the general concept correctly I don't think it matters because I will set it up a little differently in my program. My guess is they are indices that relate to the different buffers to store and recall data from(?).

Yup. Uniforms is how you pass information to the shader that is the same across all vertices/fragments, but needs to change for different passes. I think the list of buffers is just so they can have an array of them and index by friendly name, you can setup those things however you want.

ndivuyo · Post by **ndivuyo** » Fri Nov 03, 2023 10:33 pm

Thanks a ton! Honestly, I tried doing it how I described and it wasn't working properly. But I'll give it another go!

mystran · Post by **mystran** » Fri Nov 03, 2023 11:23 pm

Ok, so I didn't answer the framebuffer part.. in GL there's this object called "Framebuffer Object" (aka. FBO, note that this is not a "buffer object"), which is sort of similar in concept to a VAO, except it describes the rendering targets rather than vertex sources. The code you referred to uses a helper function to create one for each texture we're rendering to:

Code: Select all

var buildFramebuffer = function (gl, attachment) {
    var framebuffer = gl.createFramebuffer();
    gl.bindFramebuffer(gl.FRAMEBUFFER, framebuffer);
    gl.framebufferTexture2D(gl.FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.TEXTURE_2D, attachment, 0);
    return framebuffer;
};

Once the attachments (here we have only COLOR_ATTACHMENT0, but one can have several to render to multiple textures at once and there can also be a "renderbuffer" attachment for depth/stencil if needed) have been configured, it's enough to just bind the FBO itself to render to those textures... so the normal thing to do is to define one FBO for every configuration of textures you're interested in rendering into then just bind the correct FBO before the actual drawcall.

Note that with both VAOs and FBOs, they are just configuration objects and you can have multiple referring to the same underlying buffers or textures just fine... though that mostly comes into play if you are rendering to multiple textures at once and need to shuffle which set of textures is used in a more complex way than just doing basic ping-pong.

mystran · Post by **mystran** » Fri Nov 03, 2023 11:34 pm

ndivuyo wrote: ↑Thu Nov 02, 2023 10:43 pm In functions calls such as:
gl.uniform1i(subtransformProgram.uniformLocations['u_input'], PING_TRANSFORM_UNIT);

Perhaps I wasn't quite clear on this either.. the "uniform" in this case is the texture sampler and the value is the "texture unit":

Code: Select all

var buildTexture = function (gl, unit, format, type, width, height, data, wrapS, wrapT, minFilter, magFilter) {
    var texture = gl.createTexture();
    gl.activeTexture(gl.TEXTURE0 + unit);
    gl.bindTexture(gl.TEXTURE_2D, texture);
    gl.texImage2D(gl.TEXTURE_2D, 0, format, width, height, 0, format, type, data);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_S, wrapS);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_T, wrapT);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, minFilter);
    gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, magFilter);
    return texture;
};

There are many texture units, but the good thing is that the values glActiveTexture (or gl.activeTexture in WebGL) takes are sequential, so one can just add the unit to the GL_TEXTURE0 value. I'm not sure how many units WebGL requires, but it's probably at least 8 'cos that's how many that code is using (just keep in mind that there IS a limit). Once you set the "active" texture, whatever you bind goes into that texture unit until you set some other unit active. This API is a bit weird, but it has to do with the fact that very old GL only had one texture unit and glActiveTexture was originally an extension (which is kinda the general theme with a lot of GL).

gl.texImage2D here is what actually allocates the actual texture (as "creating" a texture just creates a name for a texture), for the rest... refer to the GL documentation for details, but basically they just configure filtering and how to wrap at edges, for pure GPGPU wrapping doesn't matter and the filters can be set to gl.NEAREST so you don't accidentally do interpolation if your offsets are wrong (and you're not using texelFetch, which again is WebGL2, but pretty much any desktop OpenGL version where shaders make sense; texelFetch just takes integer texel coordinates directly rather than the normalized coordinates and gives you individual texels without going through any filtering which often simplifies GPGPU.. but the code does it the WebGL1 compatible way with regular texture2D).

Now, whether you want to keep reconfiguring the sampler uniform so it refers to different units, or just rebind a different texture to the same unit is up to you... but you DO need to configure the uniform at least once to make sure it refers to whatever texture unit you're trying to use, 'cos the shader compiler is not required to assign any sensible defaults.

FFT Implementation issues (GPU Shader)