Fast Inverse Square Root Algorithm Explained

Yanksi Li

EnglishStudy NotesComputer GraphicsHPCNumerical Analysis

Published: May 2, 2021

Why we need a fast inverse square root algorithm

The inverse square root means for .

In the field of Computer Graphics, the inverse square root of a number is usually used for normalizing vectors. As this operator is used a lot, deriving a fast algorithm for it becomes very important.

Algorithm itself

float Q_rsqrt(float number) {
    long i;
    float x2, y;
    const float threehalfs = 1.5F;
    
    x2 = number * 0.5F;
    y = number;
    i = *( long * ) &y;                      // evil floating point bit hack
    i = 0x5f3759df - (i >> 1);               // what the fuck?
    y = * (float * ) &i;
    y = y * ( threehalfs - (x2 * y * y) );   // 1st iteration
//  y = y * ( threehalfs - (x2 * y * y) );   // 2nd iteration, can be removed
    return y;
}

Intuitions behind this algorithm

Use multiply instead of division in code

In general, computing the division between two numbers are much more costly than multiplying two numbers together in computer. This can be easily verified by running the following C program.

Code
Output

#include <stdio.h>
#include <time.h>
#include <stdlib.h>


const int sz = 1 << 20;
const int trail = 100;

int main() {
    clock_t start, end;
    float * arr1 = (float*)malloc(sizeof(float) * sz);
    float * arr2 = (float*)malloc(sizeof(float) * sz);
    float * arr3 = (float*)malloc(sizeof(float) * sz);

    /*
    In order to avoid the timing inconsistency caused by 'cold cache', we need to access every
    single elements inside the arrays we use for performance profiling. 
    */
    for (int i = 0; i < sz; ++i) {
        arr1[i] = (float)rand() / (float)RAND_MAX;
        arr2[i] = (float)rand() / (float)RAND_MAX;
        arr3[i] = 0;
    }
    
    double time_used_multiply = 0;
    double time_used_division = 0;

    /*
    To avoid our code any unnecessary optimization provided by compiler, we do multiplication job
    and division job alternativly in each profiling loop.
    */
    
    for (int t = 0; t < trail; ++t) {
        start = clock();
        for (int i = 0; i < sz; ++i) {
            arr3[i] = arr1[i] * arr2[i];
        }
        end = clock();
        time_used_multiply += ((double) (end - start)) / CLOCKS_PER_SEC;

        start = clock();
        for (int i = 0; i < sz; ++i) {
            arr1[i] = arr3[i] / arr2[i];
        }
        end = clock();
        time_used_division += ((double) (end - start)) / CLOCKS_PER_SEC;
    }
    time_used_multiply /= trail;
    printf("Average time used for multiplication job: %f\n", time_used_multiply);
    time_used_division /= trail;
    printf("Average time used for division job: %f\n", time_used_division);    
}

Average time used for multiplication job: 0.001960
Average time used for division job: 0.005490

As division is much more costly than computing multiplication, we want to minimize the number of occurance of division operators.

Try not to use the `sqrt()` function

The cost of sqrt() can be estimated using the following C program.

Code
Output

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <math.h>


const int sz = 1 << 20;
const int trail = 100;

int main() {
    clock_t start, end;
    float * arr1 = (float*)malloc(sizeof(float) * sz);
    float * arr2 = (float*)malloc(sizeof(float) * sz);
    
    for (int i = 0; i < sz; ++i) {
        arr1[i] = (float)rand() / (float)RAND_MAX * 1024;
        arr2[i] = 0;
    }
    
    double time_used = 0;
    
    for (int t = 0; t < trail; ++t) {
        start = clock();
        for (int i = 0; i < sz; ++i) {
            arr2[i] = sqrt(arr1[i]);
        }
        end = clock();
        time_used += ((double) (end - start)) / CLOCKS_PER_SEC;
    }
    time_used /= trail;
    printf("Average time used for sqrt job: %f\n", time_used);
}

Average time used for sqrt job: 0.004730

Algorithm explained

Newton’s Method

Newton’s method can be used to find a root of a function in an iterative manner. Therefore, we can convert the problem of computing to a root-finding problem that is solvable by using Newton’s method in the following way.

We want to find a such that . Therefore, . Then is a function that has a root when .

Using Newton’s method to derive the iterative routine for the problem:

In order to achieve desired accuracy with minimum number of iterations needed, we want our initial guess of to be as close to the true value of as possible.

Floating point number

According to IEEE 754, floating point number is represented by 32 bits in the following manner.

For detailed description about this representation, can take a look at this.

In general, the first bit represent the sign of the number, 0 for positive while 1 for negative.

The unsigned number represented by the following 8 bits is used to represent the exponent of the floating number, the actually value for the exponent is .

The remaining 23 bits are used to represent the mantissa of the floating point number, as scientific representation requires the first digit of the number to be non-zero, the value for the first digit of mantissa in the binary representation can only be 1. Therefore, we can save this digit for higher accuracy representation.

In the end, the value of the number behind the scene is:

Dark Magic

Take the logarithm of the number, we will get:

As must be a number between 0 and 1, we can go find an approximation of for between 0 and 1.

A good approximation for the function would be the following function: , where (Value from the video, and I did not verify the number by myself).

Apply the approximation to the formula above:

Notice here, we have in our formula, which is exactly the bit representation of the floating point number!

We also have the following relationship: . Let be the our close approximation to . Therefore:

Therefore, the WTF line in the code seems to be clear at this stage. The -(i>>1) term is calculating , while the number 0x5f3759df in front of this term should be the value of . However, I found that there seems to be a little inconsistency between the value of hiding behind that mysterious long and the value of presented above. The behind the long can be calculted using the following program.

Code
Output

#include <stdio.h>

int main() {
    long target = 0x5f3759df;
    long a = 3 * (1 << 22) * 127;
    long b = 3 * (1 << 22) * 0.0430;
    double mu = (double)(a - target) / (3 * (1 << 22));
    printf("%e\n", mu);
    printf("%08lx\n", a - b);
    printf("%08lx\n", target);
}

4.504657e-002
5f37be77
5f3759df

We can see that the value of behind the scene is actually around . I prefer to think that this is a small mistake made by the video editor when presenting the value of .

Assemble everything together

After all these efforts, we finally get the binary representation of the number approximate value of . And refine our approximation by using Newton’s Method for one or two iterations.

Error Estimation and Performance Profiling

The error of the algorithm is estimated over the entire domain of float using log-space grid. Here is the code and output of this profiling process.

#include <stdio.h>
#include <math.h>
#include <float.h>
#include <time.h>
#include <stdlib.h>

void logspace(float start, float end, int n, float* u) {
    float pow_start = log2f(start);
    float pow_end = log2f(end);
    float step = (pow_end - pow_start) / (n - 1);
    for (int i = 0; i < n; ++i) {
        u[i] = powf(2, pow_start + i * step);
    }
}

static inline float Q_rsqrt(float y) {
    long i;
    float x2;
    
    x2 = y * 0.5F;
    i = 0x5f3759df - ((*( long * ) &y) >> 1);               // what the fuck?
    y = * (float * ) &i;
    y *= ( 1.5F - (x2 * y * y) );   // 1st iteration
//  y *= ( 1.5F - (x2 * y * y) );   // 2nd iteration, can be removed
    return y;
}

int main() {
    int profiling = 1;
    int trail = 100;
    int sz = 1 << 20;
    float * xs = (float*)malloc(sizeof(float) * sz);
    logspace(FLT_MIN, FLT_MAX, sz, xs);
    float * result = (float*)malloc(sizeof(float) * sz);
    float * ground = (float*)malloc(sizeof(float) * sz);
    for (int i = 0; i < sz; ++i) {
        result[i] = Q_rsqrt(xs[i]);
    }

    for (int i = 0; i < sz; ++i) {
        ground[i] = 1 / sqrt(xs[i]);
    }

    if (profiling) {
        double quick_alg_time = 0;
        double orig_alg_time = 0;
        clock_t start, end;
        for (int t = 0; t < trail; ++t) {
            
            for (int i = 0; i < sz; ++i) {
                result[i] = 0;
            }

            for (int i = 0; i < sz; ++i) {
                ground[i] = 0;
            }

            start = clock();
            for (int i = 0; i < sz; ++i) {
                result[i] = Q_rsqrt(xs[i]);
            }
            end = clock();
            quick_alg_time += ((double) (end - start)) / CLOCKS_PER_SEC;

            start = clock();
            for (int i = 0; i < sz; ++i) {
                ground[i] = 1 / sqrt(xs[i]);
            }
            end = clock();
            orig_alg_time += ((double) (end - start)) / CLOCKS_PER_SEC;
        }
        quick_alg_time /= trail;
        printf("Average time used for quick algorithm: %f\n", quick_alg_time);
        orig_alg_time /= trail;
        printf("Average time used for naive approach: %f\n", orig_alg_time);
    }

    FILE* output_f = fopen("output.bin", "wb");
    if (fwrite(xs, sizeof(float), sz, output_f) != sz) {
        perror("Cannot write to file!");
        return 1;
    }
    if (fwrite(result, sizeof(float), sz, output_f) != sz) {
        perror("Cannot write to file!");
        return 1;
    }
    if (fwrite(ground, sizeof(float), sz, output_f) != sz) {
        perror("Cannot write to file!");
        return 1;
    }
}

Average time used for quick algorithm: 0.004890
Average time used for naive approach: 0.005990

import struct
import numpy as np
import matplotlib.pyplot as plt

sz = 1 << 20

with open("./output.bin", "rb") as f:
    form = "f" * sz
    xs = np.array(struct.unpack(form, f.read(struct.calcsize(form))))
    results = np.array(struct.unpack(form, f.read(struct.calcsize(form))))
    grounds = np.array(struct.unpack(form, f.read(struct.calcsize(form))))

plt.loglog(xs, np.abs(grounds - results), base=2)
plt.show()

From the output we can see that even with my naive implementation of the quick inverse square root algorithm, the algorithm still out performs the naive approach.

The error of the algorithm is acceptable for with reasonable scale. As the algorithm can be widely used in the field of computer graphics, keeping the scale of the scene reasonable should be a prerequisite for using the algorithm.

Reference

Youtube Video “Fast Inverse Square Root — A Quake III Algorithm”
https://www.youtube.com/watch?v=p8u_k2LIZyo

Modified: Jul 5, 2022

Computer Graphics

Error Analysis

High Performance Computing

Study Notes For Fractional Calculus I

Fractional derivative for polynomialIn most cases, when we are calculating different orders of de...

A Short Proof Showing Why (1-x)(1+x) Is Different From (1-x^2) During Computation

When computing , we know that catastrophic cancellation may happen when we are computing the minu...