What is the absolute denormal threshold for floats and doubles (using SSE)?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Spent some time on more ideas relating to FP error in general.
The reciprocal of the smallest non-denormal (it has exp value 1 and mantissa value 0) has the exponent value 253 and mantissa value 0. The exponent range is [0,255]. We can observe that by having symmetric zero, there are an even number of possible exponent states. If you attempt to take the reciprocal of a float value larger than stated above, it will denormal. This is an asymmetric condition! From the 1.f value, the amount of exponent levels until the denormal range is -126, but it is +127 in the INF/NAN area.
Defining the value of exponent value 103, mantissa 0, as the smallest linear addend possible to reach 1.f while being accumulated was interesting. 103 + 24 = 127, the 1.f value has that exponent value. Accumulating this value 2^24 times will reach 1.f, but it won't go higher... this was KIND OF expected however I think it warrants more investigation. By adding small LSB offsets to the base and accumulator values I did notice some whacky results. I will just post this quick piece in case anyone wants to test what I was doing. Admittedly this test is not very robust, and I keep thinking of making some kind of function library to keep things organized.

Code: Select all

#include <stdio.h>
#include <stdint.h>

void printBits(size_t const size, void const * const ptr)
{
    unsigned char *b = (unsigned char*) ptr;
    unsigned char byte;
    int i, j;
    
    for (i = size-1; i >= 0; i--) {
        for (j = 7; j >= 0; j--) {
            byte = (b[i] >> j) & 1;
            printf("%u", byte);
        }
    }
    puts("");
}


int main(int argc, char* argv[])
{
    //printBits(sizeof(f), &f);

    // reciprocal of lowest normal
    uint32_t n = 0b00000000100000000000000000000000;
    printBits(4, &n);
    float f = *(float*)&n;
    printBits(4, &f);
    float r = 1.f/f;
    printBits(4, &r);
    uint32_t s = *(uint32_t*)&r >> 23;
    printf("%u\n", s);
    float r1 = 1.f;
    uint32_t s1 = *(uint32_t*)&r1 >> 23;
    printf("%u\n", s1);
    printf("%u\n", n>>23);

    // accumulating small linear value
    uint32_t e = 24 << 23; 
    float l = 1.f;
    uint32_t li = *(uint32_t*)&l;
    li = li - (24 << 23) + 0;
    float la = *(float*)&li;
    //li += (1<<2);
    l = *(float*)&li;
    printBits(4, (uint32_t*)&l);
    printf("%.28f\n", l);
    printf("%.28f\n", la);
    int c = 0;
    for(c=1;c<(1<<(24))+0;c++)
    {
        l += la;
    }
    //l *= 16777216.f;
    printf("%.28f\n", l);
    printBits(4, (uint32_t*)&l);
    li = *(uint32_t*)&l;
    printf("%u\n", li>>23);
    return 0;
}

Post Reply

Return to “DSP and Plugin Development”