KVR Audio

camsr · Post by **camsr** » Mon Apr 15, 2024 8:06 pm

Spent some time on more ideas relating to FP error in general.
The reciprocal of the smallest non-denormal (it has exp value 1 and mantissa value 0) has the exponent value 253 and mantissa value 0. The exponent range is [0,255]. We can observe that by having symmetric zero, there are an even number of possible exponent states. If you attempt to take the reciprocal of a float value larger than stated above, it will denormal. This is an asymmetric condition! From the 1.f value, the amount of exponent levels until the denormal range is -126, but it is +127 in the INF/NAN area.
Defining the value of exponent value 103, mantissa 0, as the smallest linear addend possible to reach 1.f while being accumulated was interesting. 103 + 24 = 127, the 1.f value has that exponent value. Accumulating this value 2^24 times will reach 1.f, but it won't go higher... this was KIND OF expected however I think it warrants more investigation. By adding small LSB offsets to the base and accumulator values I did notice some whacky results. I will just post this quick piece in case anyone wants to test what I was doing. Admittedly this test is not very robust, and I keep thinking of making some kind of function library to keep things organized.

Code: Select all

#include <stdio.h>
#include <stdint.h>

void printBits(size_t const size, void const * const ptr)
{
    unsigned char *b = (unsigned char*) ptr;
    unsigned char byte;
    int i, j;
    
    for (i = size-1; i >= 0; i--) {
        for (j = 7; j >= 0; j--) {
            byte = (b[i] >> j) & 1;
            printf("%u", byte);
        }
    }
    puts("");
}


int main(int argc, char* argv[])
{
    //printBits(sizeof(f), &f);

    // reciprocal of lowest normal
    uint32_t n = 0b00000000100000000000000000000000;
    printBits(4, &n);
    float f = *(float*)&n;
    printBits(4, &f);
    float r = 1.f/f;
    printBits(4, &r);
    uint32_t s = *(uint32_t*)&r >> 23;
    printf("%u\n", s);
    float r1 = 1.f;
    uint32_t s1 = *(uint32_t*)&r1 >> 23;
    printf("%u\n", s1);
    printf("%u\n", n>>23);

    // accumulating small linear value
    uint32_t e = 24 << 23; 
    float l = 1.f;
    uint32_t li = *(uint32_t*)&l;
    li = li - (24 << 23) + 0;
    float la = *(float*)&li;
    //li += (1<<2);
    l = *(float*)&li;
    printBits(4, (uint32_t*)&l);
    printf("%.28f\n", l);
    printf("%.28f\n", la);
    int c = 0;
    for(c=1;c<(1<<(24))+0;c++)
    {
        l += la;
    }
    //l *= 16777216.f;
    printf("%.28f\n", l);
    printBits(4, (uint32_t*)&l);
    li = *(uint32_t*)&l;
    printf("%u\n", li>>23);
    return 0;
}

What is the absolute denormal threshold for floats and doubles (using SSE)?