A semi-curated blog of computer graphics and rendering.
Writing a Pathtracer in Lua, Week 5: LuaJIT and BVH

Hi! This week marks a great change for LuaPT. To further accelerate the tracing process, multiple things have been optimized and changed, and some are rebuilt from the ground up. This week, we will be introducing LuaJIT, and most importantly, the ffi extension. After that, we will implement a BVH acceleration structure in full Lua (or at least 80% Lua). LuaJIT has accelerated Lua so much and ffi has greatly simplified C++/Lua interoperation, I don’t even know where to begin. So let’s begin!

A highly reflective image of Spot the cow.


I have started trying improving the speed of LuaPT since week 4. Since week 3 begins LuaPT’s performance has been rapidly degrading and I know that soon enough, it will be so slow that I might as well trace the rays by hand. But this is when LuaJIT comes to the rescue. By simply replacing my exisitng Lua library and linking LuaJIT instead I can already see a significant increase in speed.

Next, by using ffi extension I ported my whole math library to C++.

Math library ported to C++.

As models & meshes are represented using a continuous glm::vec3 vector, we can exploit this fact and force a pointer cast when the Lua script is asking for triangles. Vector math in C++ is magnitudes faster than in Lua, and by doing so I achieved another speed up.

The ffi library is so cool. I just need to implement the functions in C++ (with an extern C block), then copy & paste my header into a ffi.cdef block. Now I can just call these functions in Lua. It’s crazy how convenient it is.

A side by side comparison.

The switch to LuaJIT is time-consuming but nothing too technical. All the previous methods and APIs (including Image and Model) ported onto LuaJIT, however losing their object-oriented APIs in the process. But I think it’s a rather small sacrifice to obtain the Speed. If you are interested in learning how LuaJIT works, I strongly recommend you check out the official website. It’s short, concise, and in less than 10 lines of code showed me how I can use it. It’s awesome. So after deprecating the Lua library and switching to LuaJIT, I begin working on a few other improvements.


Ray-Box Intersection

As we will be implementing a BVH, we have to implement a ray-box intersection method first. Here we will make use of the one in Scratchapixel, aka the slab method.

function intersect_box(ro, rd, box)
    local rdinv = vec3(1 / rd.x, 1 / rd.y, 1 / rd.z)
    local tmin = -1e309 -- 1e309 is infinity in Lua
    local tmax = 1e309

    if rd.x ~= 0 then
        local tx1 = (box.min.x - ro.x) * rdinv.x
        local tx2 = (box.max.x - ro.x) * rdinv.x

        tmin = math.max(tmin, math.min(tx1, tx2))
        tmax = math.min(tmax, math.max(tx1, tx2))
    if rd.y ~= 0 then
        local ty1 = (box.min.y - ro.y) * rdinv.y
        local ty2 = (box.max.y - ro.y) * rdinv.y

        tmin = math.max(tmin, math.min(ty1, ty2))
        tmax = math.min(tmax, math.max(ty1, ty2))
    if rd.z ~= 0 then
        local tz1 = (box.min.z - ro.z) * rdinv.z
        local tz2 = (box.max.z - ro.z) * rdinv.z

        tmin = math.max(tmin, math.min(tz1, tz2))
        tmax = math.min(tmax, math.max(tz1, tz2))

    if tmax >= tmin then
        return tmin, tmax
        return nil

Of course then we will need a bounding box data structure. The bounding box BBox is first defined in plain C then passed onto Lua using LuaJIT.

typedef struct
    Vec3C min, max;
} BBox;

Better Ray-Triangle Intersection

The brute force method of ray-triangle intersection test introduced in week 2 is also starting to slow us down. To counter this issue, I have switched the ray-triangle intersection test to a faster one, namely the Moller-Trumbore method. both the Scotty3D website and Scratchapixel are excellent learning sources.

function intersect_mt(ro, rd, tri, tmin, tmax)
    local e1 = sub3(tri.b.position, tri.a.position)
    local e2 = sub3(tri.c.position, tri.a.position)

    local pvec = cross(rd, e2)
    local det = dot3(e1, pvec)

    -- Are they almost parallel?
    if (math.abs(det) < 0.0001) then
        return nil

    local inv = 1.0 / det
    local tvec = sub3(ro, tri.a.position)
    local u = dot3(pvec, tvec) * inv
    if u < 0 or u > 1 then
        return nil

    local qvec = cross(tvec, e1)
    local v = dot3(rd, qvec) * inv

    if v < 0 or u + v > 1 then
        return nil

    local t = dot3(e2, qvec) * inv
    if t < tmin or t > tmax then
        return nil

    return vec3(u, v, t)

To further increase render precision, add more zeros to if (math.abs(det) < 0.0001) then. The implementation of Moller-Trumbore method gave us a slight speed increase for each pixel.

BVH Construction

And finally, now it’s time for us to construct a BVH. We can’t implement the whole BVH in Lua, because multithreading is achieved through multiple Lua instances and there isn’t an effective way to sync Lua variables between Lua states. So instead, the BVH data structure will be defined in C/C++, with a few getters/setters, but no serious functions that could construct a BVH directly.

The definition of a BVH.

BVH Definition (C++ Part)

The only notable functions are the constructor of BVH and the partition function. We will go over them one by one. In BVH::BVH, we accept a pointer to a Model. The BVH class then copies all triangles into its own array, and creates a root node in the nodes variable. The nodes variable acts like a complete binary tree stored in continuous memory. You can think of it as a heap.

BVH::BVH(std::shared_ptr<Model> model) : model(model)
    // First make an empty node to fit ALL triangles inside

    BBox root_bbox = bbox();
    for (int i = 0; i < model->get_num_tris(); i++)
        TriC *t = model_get_tri(model.get(), i);

        // The enclose method encloses a point into a bounding box
        enclose(root_bbox, t->a.position);
        enclose(root_bbox, t->b.position);
        enclose(root_bbox, t->c.position);

    make_node(root_bbox, 0, tri.size(), 0, 0);

The make_node method is just that - it makes a node, pushes it onto nodes, and return its index. The node is defined as such:

typedef struct
    BBox bbox;
    int start, size;
    int l, r; // Left child & right child
} Node;

In this case, we construct a root node for our BVH, make it contain the full triangle array (0 to tri.size()), and its bounding box enclosing all the vertices of the input mesh.

Because BVH stores triangles in whatever order the input mesh give us, we will need some way to rearrange the order of triangles. And to stay true to our project name, Lua has to be the one who does the job. std::partition therefore is out of the question - its final parameter pred requires a lambda function in C/C++. We therefore have to resort to a homemade partition method. By accepting an array of booleans, and putting the true values to the left, false values to the right, we can partition basically anything coming from Lua, with just an extra smidge of memory cost.

int BVH::partition(bool *pred, int begin, int end)
    assert(begin >= 0 && end < tri.size() && "Invalid partition range");

    while (begin <= end)
        if (!pred[begin])
            const Triangle *t = tri[begin];
            tri[begin] = tri[end];
            tri[end] = t;

            // Gotta swap that pred as well
            bool p = pred[begin];
            pred[begin] = pred[end];
            pred[end] = p;

    return begin;

BVH Construction (Lua Part)

That’s the whole of C++ part. As you can see, not a lot is going on in there. The real interesting thing happens in Lua.

  1. The current (root) node and its bounding box is obtained via the Lua ffi API.
  2. For each axis, 8 valid partitions among the bounding box domain is tried.
  3. A “Surface Area Heuristic” (SAH) is evaluated for each partition along each domain, and the best one with the lowest score is kept.
  4. Finally revert all the changes and execute the partition method with the best SAH.
  5. The mesh is now split into 2 following the partition; make 2 new nodes and for each of them, either make leaf or continue partitioning (back to step 1).

Demonstrated below. First, we find the perfect partition with the lowest SAH score:

Step 1: find the perfect partition with the lowest SAH score

Then, we partition the triangles within the node accordingly. Then, we split the array and give them to two new children, and recurse to obtain the best partition for them. This goes on until there aren’t enough triangles to partition, or the best partition is no partition. Then we stop.

Step 2: Recurse

For us, the SAH equation is

\[\text{SAH} = C_\text{trav} + \frac{S_\text{left}}{S} N_\text{left} C_\text{isect} + \frac{S_\text{right}}{S} N_\text{right} C_\text{isect}\]

In which \(C_\text{trav}\) (the travel cost) is a constant of 1 and the intersect cost \(C_\text{isect}\) is 2. The \(\frac{S_\text{left}}{S}\) and \(\frac{S_\text{right}}{S}\) are simply the split ratio (the \(\frac{k}{8}\) above).

This is how that looks in Lua:

-- First call the BVH constructor with a model
local bvh = make_bvh(model)

function determine_side(p, offset, axis)
    if axis == 0 then
        return p.x < offset.x
    elseif axis == 1 then
        return p.y < offset.y
        return p.z < offset.z

function determine_area_ratio(poff, span, axis)
    if axis == 0 then
        return poff.x / span.x, (span.x - poff.x) / span.x
    elseif axis == 1 then
        return poff.y / span.y, (span.y - poff.y) / span.y
        return poff.z / span.z, (span.z - poff.z) / span.z

function bvh_construct(node_idx, start, fin)
    if fin - start + 1 <= 8 then
        -- No need to make BVH; not a lot of triangles here

    local n = bvh_get_node(bvh, node_idx)
    local span = sub3(n.bbox.max, n.bbox.min)

    -- split them into 8 buckets
    local num_buckets = 8
    local split_step = scl3(span, 1 / num_buckets)

    -- Bests
    local best_axis = 0
    local best_step = 1
    local best_sah = 1e309
    local best_offset = 0

    for axis = 0, 2 do
        -- Tentatively partition them along these axis

        for step = 1, 7 do
            -- 1. Somehow partition it.
            local table = make_partitioning_table(bvh)
            local poff = scl3(split_step, step) -- Plane offset
            local plane = add3(n.bbox.min, poff)
            for i = start, fin do
                local tri = bvh_get_tri(bvh, i)
                local centroid = scl3(add3(add3(tri.a.position, tri.b.position), tri.c.position), 1 / 3)

                table[i] = determine_side(centroid, plane, axis)
            local offset = partition(bvh, table, start, fin)

            -- 2. Calculate SAH. Record the best one.
            local left, right = determine_area_ratio(poff, span, axis)
            local sah = 1 + 2 * left * (offset - start) + 2 * right * (fin - offset + 1)
            if sah < best_sah then
                best_sah = sah
                best_step = step
                best_axis = axis
                best_offset = offset

    -- That's not very constructive.
    if best_offset == start or best_offset == fin + 1 then

    -- Partition using the best one.
    local table = make_partitioning_table(bvh)
    local poff = scl3(split_step, best_step)
    local plane = add3(n.bbox.min, poff)
    for i = start, fin do
        local tri = bvh_get_tri(bvh, i)
        local centroid = scl3(add3(add3(tri.a.position, tri.b.position), tri.c.position), 1 / 3)

        table[i] = determine_side(centroid, plane, best_axis)
    local offset = partition(bvh, table, start, fin)

    local left_box = bbox()
    local right_box = bbox()
    for i = start, offset - 1 do
        local tri = bvh_get_tri(bvh, i)
        enclose(left_box, tri.a.position)
        enclose(left_box, tri.b.position)
        enclose(left_box, tri.c.position)
    for i = offset, fin do
        local tri = bvh_get_tri(bvh, i)
        enclose(right_box, tri.a.position)
        enclose(right_box, tri.b.position)
        enclose(right_box, tri.c.position)

    local l = bvh_push_node(bvh, left_box, start, offset - start, 0, 0)
    local r = bvh_push_node(bvh, right_box, offset, fin - offset + 1, 0, 0)
    bvh_node_set_children(bvh, node_idx, l, r)

    -- Recurse into l and r ???
    bvh_construct(l, start, offset - 1)
    bvh_construct(r, offset, fin)

-- Construct BVH for the whole thing
bvh_construct(0, 0, bvh_tri_count(bvh) - 1)

Current Result

With the shiny new BVH in place, combined with LuaJIT, our render is now much, much faster. Compared to a non-BVH (yes-LuaJIT) rendering of Spot the cow, which can take up to 10 minute, BVH-enabled version of the same scene now only takes 2 and a half minutes.

A highly reflective image of Spot the cow.

Here’s the Blender monkey Suzanne:

A highly reflective image of Suzanne the monkey.

Aaaand here’s a 720p torus.

A highly reflective image of Suzanne the monkey.

If you have you missed it, here’s the link to the source code. Hit a star if you feel like it. It will always shoot a stream of dopamine straight to my brain. But in any case, I will see you next week!

Further Readings

+ Loading comments +
Copyleft 2023 42yeah.