Before we begin, **massive thanks** to Marek Fiser’s blog post and Yilun Yang’s previous work, both of which I drew inspiration (and code) from, without which this paper wouldn’t have been possible.

In the paper, I proposed an adaptive method for locating *planar critical points* during streamline and streamtube tracings. New seed points are dynamically added on-the-fly and traced iteratively. See it work in action here:

Notice anything different? The left one is uniformly seeded, and the right one is seeded with the adaptive planar critical point method. Compared to the left one, our method dynamically spawns new seed points when critical points were discovered, where the vector field has the most significant changes. This results in much richer streamlines and streamtubes, capable of producing more information with the same amount of seed points. For example, more streamlines trace the turbulence underneath the upper drafts.

We also made a complete vector field visualization system for this, called *AdaptiFlux*. It supports various vector field visualization modes, including lines, arrow glyphs, streamlines and streamtubes, with our method implemented. Here’s a screenshot:

To trace streamlines in a vector field, initial “seeds”, or “seed points”, must be specified. Those seeds are then shifted according to the underlying vector field direction and velocity, drawing out the streamlines in the process. The fancy name for the above process is called Line Integral Convolution (LIC) - because to draw the line, we have to move very little delta steps, effectively performing an integration. Think about a camera pointing at the starry night with crazy long exposure.

Since initial seeds must be provided, it’s quite easy to then know the final visualization result will be highly dependent on the seeding location. Here’s the delta wing scene visualized using two sets of initial seeds, with each of them being slightly different. This is stolen directly from Marek’s blog:

However, sometimes the streamlines traced won’t be good even with a relatively good initial streamline placement. Some important twists and turns only happens later down the line, and how can we capture that if we’ve already spent all the seeds at the beginning? And how do we even find these twists and turns, if possible?

Introducing the our adaptive seeding method. In a massive nutshell, suppose we have a seed quota of \(N = 200\). the algorithm first conserves half of them to be used later, then trace the initial 100 seeds. During the initial trace, we detect if these seeds are falling into critical points. And if they do, a *seed point explosion event* occurs, where new seed points are created around the critical region. Those new seed points are then traced during the second iteration, and so on. The tracing doesn’t stop until:

- No new seed points are created during the last iteration.
- The seed point quota is used up (that is, \(N\) seeds have already been traced.)

How do we determine when to trigger an explosion event though? Well, we split it into two categories. First, we sample the vector field under point \(p\). If \(v(p) = 0\), that means we are approaching a **3D critical point**. A simple rationale for this is that the storm is always quitest in its center. 3D critical points tend to be quite complex, with the vectors around it being possible to be pointing in all sorts of directions. Like this:

This is only a cross section of a conventional critical point. Though, we can already see its quite messy around the edges. When we encounter a 3D critical point, we just generate new seed points around the old one randomly in a uniform sphere, and hope for the best.

The second category, which is also the main contribution of our paper, is a **planar critical point**. In a planar critical point, \(v(p)\) is not required to be 0; au contraire, they are supposed to be non-zero vectors pointing in arbitrary direction. The catch? The vectors around point \(p\), if projected onto a plane, must satisfy (or approach) 2D critical point rules.

Take a look at the point in the center. Though technically not a critical point, it still kinda is because the vectors around it is…well, swirling around it. Look, finding proper words for these are hard. But hopefully, you know what I mean. This is quite useful because planar critical points can help us capture what 3D critical points can’t. Even better, we can use make use of the 2D critical point types to place new seed points so that the traced result can better reflect the critical region.

If a planar critical point type is determined to be a source, sink, or saddle point, new seed points are generated in a *circular* fashion; this is because circular placement can best reflect the overall topology shape of the region. In contrast, new seeds for centers, attracting foci, and repelling foci are placed in a *linear* fashion. This makes the most out of all the seed points, maximizing the amount of information:

“Well that’s good and all,” I hear you say, “but can you even locate these planar critical points guv’nor?” Well first, I haven’t heard the word “guv’nor” in ages. Second, its quite simple actually. Since planar critical points cannot be zero, it must be pointing to some sort of direction. This direction can act as a normal vector, so that the right vector *r* and the up vector *u* can be:

Where \(a[v(p)]\) is an arbitrary up vector for \(v(p)\). Then, \(\|N\|\) neighbors are taken surrounding this plane, in a circular fashion. The sample points are quite like the “circular explosion” in the figure above:

\[N_i = p + \cos(2 \pi \frac{i - 1}{\|N\|}) r + \sin(2\pi \frac{i - 1}{\|N\|}) u\]An *alignment metric* \(\mu_p\) is evalated by computing the normalized dot product of \(v(p) \text{ and } v(N_i)\):

If they align poorly, as in, \(\mu_p\) is under a certain threshold, then we determine that we have encountered a planar critical point. In ideal circumstances, \(\mu_p\) should equal to 0 because all its neighbors are perpendicular to \(v(p)\); but obviously this is quite difficult to happen.

In any case, once its planar critical point status was determined, we need to find out its critical point type next by finding out the eigenvalues for its 2D planar Jacobian matrix, shown in Table 1. To do this, we first find the matrix *U* to transform \(v(p)\) to local space (or tangent space) - so that we only have to deal with 2D vectors. That is,

*U* can be solved by imaging \(v(p)\) is rotated from \(\begin{bmatrix}0 \\ 0\\ 1\end{bmatrix}\) and then reverse that rotation:

We can now calculate the vector gradients along \(r\) and \(u\):

\[\begin{cases} \Delta_r v(p) &= \frac{v(p) - v(p - \varepsilon r)}{\varepsilon} \\ \Delta_u v(p) &= \frac{v(p) - v(p - \varepsilon u)}{\varepsilon} \end{cases}\]And plug them into the Jacobian matrix:

\[J_p = \begin{bmatrix} U \Delta_r v(p)_x & U \Delta_r v(p)_y \\ U \Delta_u v(p)_x & U \Delta_u v(p)_y \\ \end{bmatrix}\]Now that we have the Jacobian matrix, notice that sources, sinks and saddle points have real eigenvalues, while centers, attracting foci, and repelling foci do not. That means we don’t really need to find out the actual eigenvalues - we just need to find out if they *have* real eigenvalues. Recall that we only have two new seed placement strategies: circular, which corresponds to the former three, while linear corresponds to the latter.

With the aforementioned information, we can actually do a little trick to accelerate computation. Take a look at *A quick trick for computing eigenvalues* from 3Blue1Brown (and yes he’s my favorite math YouTuber) - especially around 4:28, where he showed the quick formula for finding out the eigenvalues:

Where \(m = \frac{J_p^{1,1} + J_p^{2,2}}{2} \text{ and } p = \det(J_p)\). Notice that we can only have real roots when \(\sqrt{m^2 - p} >= 0\). This can be used to quickly determine the critical point type without ever needing to solve the eigenvalues (though, to be frank, is not that far off), and, in turn, determine the critical point type, and the seed point spawning strategy.

After tracing the streamlines, a simple method is used to de-clutter the them - based on the global distortion value. Its just a kinda fancy name for the ratio between the actual line length and the distance between streamline starting point and ending point:

\[T(L) = \frac{\|L\|}{\|L_\text{end} - L_\text{begin}\|}\]Streamlines are removed if \(T(L)\) is too low, or in other words, if they are too straight. Here’s how the decluttered streamlines look like. Mostly straight lines are removed to again, prioritize showing interesting streamlines first.

… And this is how we improve upon the original streamline tracing methods. This can be a direct plug in for other initial streamline seeding strategies as well; since our method is adaptive in nature, it could be plugged into, say, spherical seeding, and it could still be able to produce overall better streamlines. Think of it as a plug-in for the existing initial seeding strategies.

Finally, let’s take a look at the aforementioned vector field visualization program *AdaptiFlux*. Because I have extended the debugger so much during the implementation it became a full-fledged CUDA-accelerated vector field visualizer. So, let’s take a look at what it’s capable of.

- Four visualization methods: Lines, arrow glyphs, streamlines, and streamtubes.
- Contains the full implementation of all proposed methods.
- A real time FPS monitor is available right within the system.
- Debugging features: seed point visualization visualizer, camera pose save/retrieve, and more.
- Two initial seeding strategies: linear and spherical.
- All visualizations are real time; tweak the parameters, and they are updated instantaneously. The rendering procedure is done by OpenGL/CUDA interoperation.

The source code of *AdaptiFlux* is available at https://github.com/42yeah/AdaptiFlux. Here are some extra goodies:

So, in conclusion, we defined a new sort of critical point in 3D, *planar critical point*, which we find by calculating an alignment metric around the neighbor. Once its found, the planar critical point type is determined based on the eigenvalues of the Jacobians, which we optimized a little due to the fact that we don’t actually need to evaluate the eigenvalues. The critical point type is used to determine new seed placement locations, so that we can maximize the amount of information and decrease repetitive streamlines. Finally, we made a vector field visualization system.

Obviously, quite a lot of things can be improved upon. For example, there are waaay too much parameters right now; in the future, maybe some form of machine learning can help decrease that and improve that? The implementation code quality is also something to be improved, as I was still learning CUDA as of this paper (well, actually I still am). However, seeing that I am about to graduate, I may not have time to work on this anymore (not to mention this is not a particularly good paper). So I guess there’s that.

The reason I write this blog post is because I think that accessible knowledge, not only in the form of, well, being free, but also easier to read, and therefore to learn, is important. And I always have trouble reading academic papers. That’s why I try to write a blog post on my own paper: hopefully when someone stumbles upon my paper one day, they will Google it online, and then they’ll encounter this. And hopefully the more relaxed explanation is easier to understand.

And that’s it! Whew. First blog post in 6 months! I was extremely busy the last 6 months due to school work and what not, but I am glad to be back. Stay tuned for more <3

]]>OSAT has officially turned *one year old* today. Just absolutely incredible. I did not anticipate this blog to reach this old; in fact, I thought I would’ve abandon it at like June or something. But it somehow persevered! Though update schedules have become unstable, we’re still posting. And hey, that’s really all we need.

Anyway, new year, new us. Maybe I’ll update the look & feel of the blog, clean it up a bit. The current look is kind of retro, but maybe I’ll make the interface more minimal so that the UI elements are less distracting (I designed & wrote the CSS myself! :). Other than that, not much will change in 2024 for the blog. I will continue writing new stuffs to you guys. Writing is enjoyable, and getting responses from you guys moreso. And now, a list of what I’ll be writing this year:

- More Gaussian Splatting!
- Instancing grass
- Quaternions
- Glass
- Water?
- Global illumination?
- Unreal Engine stuffs??
- Noita??? (I’ve been playing it recently and it’s
**such**a cool game)

Well, that’s something. Hopefully we will cover all of them in ‘24! And hey, if you want to write about CG-related stuffs, I will happily share this platform with you. It may not be the best platform, but according to Google, our blog has at least 7 clicks a day (wow!). Crazy, I know. Plus, I don’t mind some guest articles. It doesn’t have to be convoluted, so long as it’s computer graphic-y things. Contact me here.

That’s all for now. For all my readers, take care, and you have a fantastic year.

]]>At the end of the day, the “gaussians” the paper mentioned are just fancy ellipsoids. So the problem becomes, how do we render fancy ellipsoids *fast* (because boy there sure are a lot of ellipsoids), and can we do it without CUDA?

Note: if you are trying to follow & implement the same thing in this blog post, you can check out the modified`SIBR_Viewers`

’s “ellipsoid” visualization method and use it as a reference, available on GitHub. The full source code of the rasterizer in this blog post is also available on GitHub.

In any case, let’s first take a look at the trained scene. Gaussian Splatting is awesome because there’s no trained vendor lock-in; I have no idea if I used the right word. The trained scene is just one giant binary .PLY file, one which we can straight up open in MeshLab. Only we can’t really do that; we are only shown a bunch of points.

So how are we supposed to view it? Time to delve deeper and into `SIBR_Viewers`

’s source code. Specifically, how `RichPoint`

is defined.

Warning: this blog post will havetonsof code segments. Sorry in advance. Especially you guys, mobile readers <3

```
template<int D>
struct RichPoint
{
Pos pos;
float n[3];
SHs<D> shs;
float opacity;
Scale scale;
Rot rot;
};
```

We learn that a `RichPoint`

is just a vertex with extra properties:

*Pos*is a 3-dimensional vector.*SHs*are a \((D + 1)^2\) array of 3-dimensional vectors used for spherical harmonics.*Scale*is a 3-dimensional vector to denote scaling at the*x*,*y*, and*z*axis.*Rot*is a 4-d vector to represent rotation in the form of a quaternion.

The PLY file is just densely-packed with these `RichPoint`

s. One extra thing to note though, the size of a `RichPoint`

change depending on the spherical harmonics dimension, which is defined in your training parameters. But usually, when no training parameters are supplied, \(D = 3\). So let’s go ahead and just assume that - and then read the whole PLY model into memory.

```
// Short for "Gaussian Splatting Splats"
struct GSSplats
{
bool valid; // Are the in-memory splats valid?
int numSplats; // How many splats there are?
std::vector<RichPoint> splats;
};
std::unique_ptr<GSSplats> GSPointCloud::loadFromSplatsPly(const std::string &path)
{
std::unique_ptr<GSSplats> splats = std::make_unique<GSSplats>();
splats->numSplats = 0;
splats->valid = false;
std::ifstream reader(path, std::ios::binary);
if (!reader.good())
{
std::cerr << "Bad PLY reader: " << path << "?" << std::endl;
return std::move(splats);
}
// Get the headers out of the way
std::string buf;
std::getline(reader, buf);
std::getline(reader, buf);
std::getline(reader, buf);
std::stringstream ss(buf);
std::string dummy;
// Read the number of splats and resize the `splats` array
ss >> dummy >> dummy >> splats->numSplats;
splats->splats.resize(splats->numSplats);
std::cout << "Loading " << splats->numSplats << " splats.." << std::endl;
while (std::getline(reader, dummy))
{
if (dummy.compare("end_header") == 0)
{
break;
}
}
// Read the whole thing into memory. "The lot", as they say.
reader.read((char *) splats->splats.data(), splats->numSplats * sizeof(RichPoint));
if (reader.eof())
{
std::cerr << "Reader is EOF?" << std::endl;
splats->valid = false;
return std::move(splats);
}
splats->valid = true;
return std::move(splats);
}
```

The above code snippet will try to parse a trained PLY scene, and if it fails, or encountered EOF during reading, it returns an *invalid* `GSSplat`

. Otherwise, it returns a valid PLY scene, with the points safely stored inside the `splats`

vector.

What’s cool about this is the fact that by simply reading the scene, we already have enough information to visualize the scene (in the form of a point cloud). So let’s do that!

```
// ShaderBase::Ptr is just a shared_ptr<ShaderBase> with a .use() method
bool GSPointCloud::configureFromPly(const std::string &path, ShaderBase::Ptr shader)
{
const auto splatPtr = loadFromSplatsPly(path);
if (!splatPtr->valid)
{
return false;
}
_numVerts = splatPtr->numSplats;
_shader = shader;
// Initialize VAO and VBO
glBindVertexArray(_vao);
glBindBuffer(GL_ARRAY_BUFFER, _vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(GSPC::RichPoint) * splatPtr->numSplats, splatPtr->splats.data(), GL_STATIC_DRAW);
// #1: Set vertex attrib pointers
constexpr int numFloats = 62;
glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(float) * numFloats, nullptr);
glEnableVertexAttribArray(1);
glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, sizeof(float) * numFloats, (void *) (sizeof(float) * 6));
// #2: Set the model matrix
_model = glm::scale(_model, glm::vec3(-1.0f, -1.0f, 1.0f));
return true;
}
```

We set the vertex attrib pointers in `#1`

. Why 62 though? Well, one `RichPoint`

is comprised of 62 floats in grand total:

We set two vertex attrib pointers: one for vertex positions, and the other for vertex colors. For the vertex color one, we simply use the first band of the spherical harmonics.

Then, in `#2`

, we set the model matrix by flipping around the scene in Y axis and Z axis. For some weird reason, the scene is totally inverted - I am unsure as to why, but I guess it’s due to right hand/left hand shenanigans. Anyway, we need to correct that by setting the model matrix. Or, alternatively, you can also correct that by changing your camera’s arbitrary `up`

as (0, -1, 0), which we will later do (and explain why). For now, setting the model matrix should suffice.

Next up, we implement our (very simple) vertex shader:

```
#version 430 core
layout (location = 0) in vec3 aPos;
layout (location = 1) in vec3 aSH;
// ... uniform MVP matrices
out vec3 color;
void main() {
gl_Position = perspective * view * model * vec4(aPos, 1.0);
color = aSH * 0.28 + vec3(0.5, 0.5, 0.5);
}
```

And fragment shader:

```
#version 430 core
in vec3 color;
out vec4 outColor;
void main() {
outColor = vec4(color, 1.0);
}
```

And BAM! We have our point cloud, ladies and gentlemen.

Obviously we can’t just stop at point clouds. Our target is to render the splats, after all. However, we have already half-succeeded since splats are just points enlarged to various forms (“gaussians” in our case). We can’t directly use the `RichPoint`

s though, not yet; we have to preprocess them.

The preprocessing involves:

- Exponentiate scales;
- Normalize quaternions;
- And activate opacities (by passing them through a
*sigmoid*function).

While we’re doing that, let’s transform the input data from a densely packed array of structures (AoS) into structure of arrays (SoA) as well. This is required since rendering splats are not as trivial as rendering points and will require new OpenGL functionalities.

```
std::vector<glm::vec4> positions;
std::vector<glm::vec4> scales;
std::vector<glm::vec4> colors;
std::vector<glm::vec4> quaternions;
std::vector<float> alphas;
positions.resize(splatPtr->numSplats);
// ... other resizes
for (int i = 0; i < splatPtr->splats.size(); i++)
{
const glm::vec3 &pos = splatPtr->splats[i].position;
positions[i] = glm::vec4(pos.x, pos.y, pos.z, 1.0f);
const glm::vec3 &scale = splatPtr->splats[i].scale;
scales[i] = glm::vec4(exp(scale.x), exp(scale.y), exp(scale.z), 1.0f);
const SHs<3> &shs = splatPtr->splats[i].shs;
glm::vec4 color = glm::vec4(shs.shs[0], shs.shs[1], shs.shs[2], 1.0f);
colors[i] = color;
quaternions[i] = glm::normalize(splatPtr->splats[i].rotation);
alphas[i] = sigmoid(splatPtr->splats[i].opacity);
}
```

Our next problem is splat rendering. Again, recall that the “gaussians” are just fancy ellipsoids - so now, we have to render them. But how to rasterize one, really? OpenGL doesn’t really support the rendering of ellipsoids. Well, you’re in so much luck, because I **just** wrote a blog post on how to render them - enter Rendering a Perfect Sphere in OpenGL, then an Ellipsoid. Please go ahead and read that first!

Alright, have you read that? If you haven’t, I will make a short summary for you: the core idea is to first render the ellipsoid’s bounding box, then discard fragments during the fragment processing pass by checking if the camera ray hits the inner ellipsoid or not. One thing to add here is the fact that we are not just rendering one ellipsoid. Rather, we are rendering tens of thousands of them. This calls for the first technique we’ll have to use, and that is instancing.

We first create our cube VAO:

```
bool GSEllipsoids::configureFromPly(const std::string &path, ShaderBase::Ptr shader)
{
const auto splatPtr = loadFromSplatsPly(path);
if (!splatPtr)
{
return false;
}
_numInstances = splatPtr->numSplats;
_numVerts = 36;
glBindVertexArray(_vao);
glBindBuffer(GL_ARRAY_BUFFER, _vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(cube), cube, GL_STATIC_DRAW);
glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(float) * 3, nullptr);
return true;
}
```

The static `cube`

variable is defined here. During rendering, we instance render the splat number of cubes:

```
_shader->use();
// ... uniforms
glBindVertexArray(_vao);
glDrawArraysInstanced(GL_TRIANGLES, 0, _numVerts, _numInstances);
```

Now we have tens of thousands of cubes in one place. Cool! It is also unbelievably laggy.

Next, we need to put the cubes into their proper places - as the cubes are the bounding boxes of the splats, the cube center should equal exactly to the center of the splats, as is the scales and rotations. How can we access these things in our shaders though? They are not available as vertex attributes; we only have the vertices of the cube for that. Or, we can try using Uniform Buffer Objects (UBO), but that falls short due to the **large** amount of data.

That, and the UBO requires a predetermined size before rendering, making using UBO unrealisitic. So how can we pass them into the shader? Time to bust out the big gun: Shader Storage Buffer Object (SSBO)! SSBOs are very much like UBOs, but without any of its shortcomings, and is generally just better. Their differences are:

- SSBOs can be much, much larger. According to Khronos, UBOs can be to 16KB, but is implementation specific. In comparison, the spec guarantees SSBOs can be up to 128MB; most implementations can let you just use the whole GPU memory.
- SSBOs are writable. In shader!
- SSBOs can have
**variable storage**. No more fixed size preallocation! - SSBOs are likely slower than UBOs.

Time to allocate 5 SSBOs, for `positions`

, `scales`

, `colors`

, `quaternions`

, and `alphas`

, respectively. We can implement a template function capable of creating SSBOs of different atomic types, and it’s really simple:

```
template<typename T>
GLuint generatePointsSSBO(const std::vector<T> &points)
{
GLuint ret = GL_NONE;
glCreateBuffers(1, &ret);
glNamedBufferStorage(ret, sizeof(T) * points.size(), points.data(), GL_MAP_READ_BIT);
return ret;
}
```

Then we create SSBOs using

```
_positionSSBO = generatePointsSSBO(positions);
_scaleSSBO = generatePointsSSBO(scales);
_colorSSBO = generatePointsSSBO(colors);
_quatSSBO = generatePointsSSBO(quaternions);
_alphaSSBO = generatePointsSSBO(alphas);
```

And we’re good to go. Just don’t forget to delete them when the program exists via `glDeleteBuffers`

. Pass the SSBOs into the shaders using `glBindBufferBase`

:

```
_shader->use();
// ... uniforms
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 0, _positionSSBO);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 1, _scaleSSBO);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 2, _colorSSBO);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 3, _quatSSBO);
glBindBufferBase(GL_SHADER_STORAGE_BUFFER, 4, _alphaSSBO);
glBindVertexArray(_vao);
glDrawArraysInstanced(GL_TRIANGLES, 0, _numVerts, _numInstances);
```

Next up, set the buffers to their respective binding points in our vertex shader. We can now access them by using `positions[gl_InstanceID]`

etc.

```
layout (std430, binding = 0) buffer splatPosition {
vec4 positions[];
};
layout (std430, binding = 1) buffer splatScale {
vec4 scales[];
};
layout (std430, binding = 2) buffer splatColor {
vec4 colors[];
};
layout (std430, binding = 3) buffer splatQuat {
vec4 quats[];
};
layout (std430, binding = 4) buffer splatAlpha {
float alphas[];
};
```

With all these in our hands, we can now transform the input vertices to their correct position.

```
// layout (std430, ...
out vec3 position;
out vec3 ellipsoidCenter;
out vec3 ellipsoidScale;
out mat3 ellipsoidRot;
out float ellipsoidAlpha;
void main() {
// #1: scale the input vertices
vec3 scale = vec3(scales[gl_InstanceID]);
vec3 scaled = scale * aPos;
// #2: transform the quaternions into rotation matrices
mat3 rot = quatToMat(quats[gl_InstanceID]);
vec3 rotated = rot * scaled;
// #3: translate the vertices
vec3 posOffset = rotated + vec3(positions[gl_InstanceID]);
vec4 mPos = vec4(posOffset, 1.0);
// #4: pass the ellipsoid parameters to the fragment shader
position = vec3(mPos);
ellipsoidCenter = vec3(positions[gl_InstanceID]);
ellipsoidScale = scale;
ellipsoidRot = rot;
ellipsoidAlpha = alphas[gl_InstanceID];
gl_Position = perspective * view * model * mPos;
color = vec3(colors[gl_InstanceID]) * 0.28 + vec3(0.5, 0.5, 0.5);
}
```

We will need to first scale the vertices, then rotate them, then finally translate them, as shown in `#1`

, `#2`

, and `#3`

. Imagine if we are transforming a sphere. If we rotate the sphere first *then* scale it, it wouldn’t make much sense: rotating a sphere is completely meaningless as spheres are rotation invariant. So we have to transform them in this order. In `#2`

, we transform the quaternions into rotation matrices. Here I just Googled & used the quaternion-rotation matrix translation function from Automatic Addison; if you are interested and want to learn more about quaternions, do check out Visualizing Quaternions by Ben Eater and 3Blue1Brown.

```
mat3 quatToMat(vec4 q) {
return mat3(2.0 * (q.x * q.x + q.y * q.y) - 1.0, 2.0 * (q.y * q.z + q.x * q.w), 2.0 * (q.y * q.w - q.x * q.z), // 1st column
2.0 * (q.y * q.z - q.x * q.w), 2.0 * (q.x * q.x + q.z * q.z) - 1.0, 2.0 * (q.z * q.w + q.x * q.y), // 2nd column
2.0 * (q.y * q.w + q.x * q.z), 2.0 * (q.z * q.w - q.x * q.y), 2.0 * (q.x * q.x + q.w * q.w) - 1.0); // last column
}
```

In `#4`

, we pass all the ellipsoid parameters into the fragment shader to perform sphere tracing later; let’s not do that now though. What happens if we try to visualize the result now? It’s quite interesting, as it is already very close to the actual scene. We can even walk around and take a look.

Warning: large video! (13.2MB)

The final step involves tracing rays into the screen and determining if any of them hits the ellipsoids within the bounding boxes. Again, give this blog post a read if you want to properly understand how. With all these information we have though, it is more than enough for us to render the ellipsoids.

```
in vec3 position;
in vec3 ellipsoidCenter;
in vec3 ellipsoidScale;
in float ellipsoidAlpha;
in mat3 ellipsoidRot;
in vec3 color;
uniform vec3 camPos;
// ... MVP uniforms
out vec4 outColor;
void main() {
vec3 normal = vec3(0.0);
// Check if we intersects with the sphere given the ellipsoid center, scale, and rotation.
// Normal is an output variable; we can additionally obtain the normal once we know there is an intersection.
vec3 intersection = sphereIntersect(ellipsoidCenter, camPos, position, normal);
// Discard if there's no intersections
if (intersection == vec3(0.0)) {
discard;
}
vec3 rd = normalize(camPos - intersection);
float align = max(dot(rd, normal), 0.1);
vec4 newPos = perspective * view * model * vec4(intersection, 1.0);
newPos /= newPos.w;
gl_FragDepth = newPos.z;
// Lightly shade it by making it darker around the scraping angles.
outColor = vec4(align * color, 1.0);
}
```

The fragment depth is updated once an intersection is found. Since the world position of the shading point, which is on the box, is different to the world position of the inner ellipsoid, we have to reproject the position to NDC and set the depth appropriately. And if we can obtain the normal from the intersection check, might as well shade it a little bit to make it look exactly like the ellipsoids in `SIBR_Viewers`

.

Warning: large video! (13.2MB)

This is also when we have to ditch the model matrix flip and change the camera to upside down instead. The sphere tracing have not accounted for the model matrix and will result in all sorts of janky ellipsoids.

The result is quite close but some places are still unsatisfactory. For example, there are a lot of noisy ellipsoids in the way during our visualization blocking most of the view. Fixing that is simple enough: those ellipsoids usually have a very low alpha value, so we can filter them out easily via a simple alpha threshold check.

```
void main() {
if (ellipsoidAlpha < 0.3) {
discard;
}
vec3 normal = vec3(0.0);
vec3 intersection = sphereIntersect(...
```

This can greatly clean up the scene:

And finally, let’s scale the size of each ellipsoids by 2. I have no concrete idea why, but it is what `SIBR_Viewers`

has done.

```
void main() {
vec3 scale = vec3(scales[gl_InstanceID]) * 2.0;
vec3 scaled = scale * aPos;
mat3 rot = quatToMat(...
```

It has the benefit of making the text on the train much more prominent and easier to read (take a look at “Western Pacific”). Though, the frame rate drops greatly, probably due to the fact that we now need to trace more rays for each ellipsoid:

And that’s it!

Isn’t it supposed to look like this though?

Yeah, but this render is done in CUDA. Though we have avoided CUDA throughout this blog post, and achieved… results, there is something that OpenGL can just not do. The most powerless one being alpha blending.

I have missed one key thing in the paragraphs above. “Gaussians” are not only ellipsoids; they are transparent ellipsoids, as seen in the `alpha`

part of `RichPoint`

. This means we have to perform alpha composition during the rendering of our gaussians. But alpha compositing is non-commutative; that is, A over B is not equal to B over A.

To perform alpha composition correctly, we have to render the furthest thing first, then composite it bit by bit until the closest. Putting into context of our gaussian splat rendering program, this means we need to sort the splats by depth relative to the camera *every frame*. Because the viewing camera is constantly on the move, we have to re-calculate it every frame, in case the order of splats change between frames. OpenGL alone is obviously not enough; not to mention that instanced rendering, the thing that makes us have more than 1 FPS in the first place, totally ignores the drawing order.

However, not all hope is lost. If we have some (very quick) method capable of sorting more than two hundred thousand splats by depth per-frame, we may just be onto something. After that, we can compute the *real* SH color in the fragment shader as well, by taking all 16 coefficients into consideration. Then maybe, just maybe, we can avoid using CUDA altogether. For now though, CUDA offers much better flexibility during splat rasterization, so it will have to do - and we’ll have another blog post on how to *properly* rasterize the gaussian splats in the unspecified future.

Again, if you are interested in the implementation of the rasterizer in this blog post, the full source code is available online on GitHub, so go and take a look. Until then, toodles!

]]>But now here comes the **real** problem: how is this “screen-space raytracing” done? Well, let’s take a look!

As we are, in actuality, rendering a bounding box of the sphere, we have the world position of the pixel we are rendering in during the fragment processing pass. However, that world position is not the world position of our sphere. Rather, it is the world position of the bounding box, which almost always lies outside of the sphere, unless it’s the 6 tangent points. Nevertheless, we still need to calculate that first in the vertex shader:

```
#version 430 core
layout (location = 0) in vec3 aPos;
uniform mat4 model;
uniform mat4 view;
uniform mat4 perspective;
out vec3 worldPos;
void main() {
vec4 wPos = model * vec4(aPos, 1.0);
worldPos = vec3(wPos);
gl_Position = perspective * view * wPos;
}
```

To recover the real world position of the shading fragment of our sphere, we need a ray-sphere intersection check. Luckily, we know the ray direction: since the box intersection point \(p_\text{box}\), sphere intersection point \(p_\text{sphere}\), and camera position \(o\) are colienar, we can deduce that ray direction

\[d = \frac{p_\text{box} - o}{|| p_\text{box} - o ||}\]Let’s assume the sphere has a radius of 1. Assuming the ray does hit the sphere, the intersection point \(p_\text{sphere}\), which is **on the sphere**, its distance to the sphere center must equal to its radius (1).

So now we have

\[r = ||p_\text{sphere} - c|| = || o + t d - c || = 1\]If we can solve for \(t\), we can know exactly where the sphere shading point is located. So let’s expand the above equation by raising both sides to the power of 2. And if you recall, the length of a vector is the square root of the dot product of the vector with itself. So:

\[\langle (o + t d - c), (o + t d - c) \rangle = 1\]Let’s represent \(o - c\) as \(u\). It represents camera’s relative position to the center of the sphere, i.e. Camera’s sphere-local coordinate. By throwing the 1.0 to the left side as well, the above equation can be rewritten as:

\[\langle td + u, td + u \rangle - 1.0 = 0\]And now we further expand those dot operators and isolate \(t\):

\[\begin{align} (u_x + t d_x)^2 + (u_y + t d_y)^2 + (u_z + t d_z)^2 - 1 &= 0 \\ t^2 d_x^2 + u_x^2 + 2t u_x d_x + t^2 d_y^2 + u_y^2 + 2t u_y d_y + t^2 d_z^2 + u_z^2 + 2t u_z d_z - 1 &= 0 \\ (d_x^2 + d_y^2 + d_z^2) t^2 + 2(u_x d_x + u_y d_y + u_z d_z) t + u_x^2 + u_y^2 + u_z^2 - 1 &= 0 \end{align}\]Waaait a minute… Something is looking kind of fishy. We can bust out the quadratic equation solution formula!

\[\begin{cases} a = (d_x^2 + d_y^2 + d_z^2) = \langle d, d \rangle \\ b = 2(u_x d_x + u_y d_y + u_z d_z) = 2 \langle u, d \rangle \\ c = u_x^2 + u_y^2 + u_z^2 - 1.0 = \langle u, u \rangle - 1 \\ \end{cases}\]Discriminant \(\Delta\) can be calculated using \(b^2 - 4 a c\). And if it is less than 0? That means the ray has no intersection with the sphere. This is definitely a thing:

When this happens, we just `discard`

the current fragment. No extra steps needed. And now, with the discriminant at hand, now the only thing left for us to do is to actually solve \(t\).

We will take the \(t\) with the smallest value, since the ray intersects the closer point (to us) first, not the further one. Now \(p_\text{sphere}\) can be properly recovered by using

\[p_\text{sphere} = o + t d\]```
#version 430 core
in vec3 worldPos;
out vec4 color;
uniform vec3 camPos;
uniform vec3 sphereCenter;
vec3 sphereIntersect(vec3 c, vec3 ro, vec3 p) {
vec3 rd = vec3(normalize(p - ro));
vec3 u = vec3(ro - c); // ro relative to c
float a = dot(rd, rd);
float b = 2.0 * dot(u, rd);
float cc = dot(u, u) - 1.0;
float discriminant = b * b - 4 * a * cc;
// no intersection
if (discriminant < 0.0) {
return vec3(0.0);
}
float t1 = (-b + sqrt(discriminant)) / (2.0 * a);
float t2 = (-b - sqrt(discriminant)) / (2.0 * a);
float t = min(t1, t2);
vec3 intersection = ro + vec3(t * rd);
return intersection;
}
void main() {
vec3 sp = sphereIntersect(sphereCenter, camPos, worldPos);
if (sp == vec3(0.0)) {
discard;
}
color = vec4(abs(sp), 1.0);
}
```

The code should provide a perfect sphere. And it stays a sphere, even when you are very, very close to it.

You will find the above shaded result looks suspiciously like the normals of the sphere. Though we are shading the coordinates, it feels very normal-ly. And it is! Remember that we are shading a unit sphere, and the sphere I presented above is located at the dead center of the world, making the shading points equals to their normals. And even if the sphere is not located at the center, the normal is still simple enough to obtain. Just negative offset the \(p_\text{sphere}\) by the sphere center, and BAM! That’s our normal.

```
vec3 localIntersection = intersection - c;
normal = localIntersection;
```

Since we have already taken care of screen-space sphere raytracing, ellipsoid is basically free - with a few extra steps. Since we can treat ellipsoids like transformed spheres, we can assign two extra properties for our sphere:

- The scaling of the ellipsoid \(s\).
- The rotation of the ellipsoid \(R\);

During our tracing, we require that a sphere is first scaled, then rotated. If we rotate first, then the rotation will be meaningless - as spheres are rotation-invariant.

The scaling is not required to be uniform. As in, the scaling vector is not required to be equal on all axes. So now that we have these two vectors \(s\) and \(R\), how do we trace out the transformed sphere? Let’s go over scaling and rotation one by one.

To scale the inner ellipsoid, we need to first scale the outer bounding box, as the bounding box must contain the whole thing.

```
void Ellipsoid::compositeTransformations()
{
// Composite transformations in the reverse order:
// translate <- scale
_model = glm::mat4(1.0f);
_model = glm::translate(_model, _center);
_model = glm::scale(_model, _scale);
}
```

Now that that’s out of the way, recall the sphere raytracing code above:

```
// We added a new "normal" output parameter
vec3 sphereIntersect(vec3 c, vec3 ro, vec3 p, out vec3 normal) {
vec3 rd = vec3(normalize(p - ro));
vec3 u = vec3(ro - c); // ro relative to c
// ...
vec3 intersection = ro + vec3(t * rd);
vec3 localIntersection = intersection - c;
normal = localIntersection;
return intersection;
}
```

You might’ve noticed that the whole tracing procedure is done in *local space*. `u`

is the relative camera origin to the sphere center, and `rd`

is the incoming ray direction. This means not much modification is needed to achieve nonuniform sphere scaling: in lieu of transforming the sphere itself, we can **reverse-transform the camera to the local space based on the sphere transforming parameters**.

In other words, if the ellipsoid is scaled to get bigger, the relative camera position should be closer to the sphere, as it takes less spheres to get to the camera (because the unit of length in the local space grows). Perhaps an illustration is better.

To account for this, we apply the inverse scaling to both the ray direction `rd`

and relative camera position `u`

:

```
vec3 rd = normalize(p - ro) / vec3(sphereScale);
vec3 u = vec3(ro - c) / vec3(sphereScale); // ro relative to c
```

We don’t need to re-normalize `rd`

after the scaling: we are simply using `rd`

to solve for \(t\) anyways, and we can always recover the correct intersection point by evaluating \(p = o + t d\). Plus, we need to scale the intersection point from local position back to world position anyways, and normalizing `rd`

will mess that up. Now, we update how `intersection`

and `normal`

are calculated as well:

```
vec3 intersection = ro + vec3(t * rd) * sphereScale;
vec3 localIntersection = (intersection - c) / sphereScale;
normal = localIntersection;
```

Now we can freely scale our ellipsoid!

Rotation is not too different. First, we update the model matrix again:

```
void Ellipsoid::compositeTransformations()
{
// Composite transformations in the reverse order:
// translate <- rotation <- scale
_model = glm::mat4(1.0f);
_model = glm::translate(_model, _center);
_model = _model * _rotation;
_model = glm::scale(_model, _scale)
}
```

Recall that the inverse rotation matrix is equal to the transpose of the rotation matrix. As we are transforming in the order of scale → rotation, we need to first reverse rotate **then** inverse scale back to local space.

```
vec3 sphereIntersect(vec3 c, vec3 ro, vec3 p, out vec3 normal) {
mat3 sphereRotationT = transpose(sphereRotation);
vec3 rd = (sphereRotationT * normalize(p - ro)) / vec3(sphereScale);
vec3 u = (sphereRotationT * vec3(ro - c)) / vec3(sphereScale); // ro relative to c
// ...
```

The traced intersection point now needs to go back to the world space by scaling then rotating, and the inverse intersection being the inverse of that:

```
vec3 intersection = ro + sphereRotation * (vec3(t * rd) * sphereScale);
vec3 localIntersection = ((mat3(sphereRotationT) * (intersection - c)) / sphereScale);
```

There’s this little thing that the normal itself should **not** be affected by the rotation, and so we need to apply the rotation matrix, again.

```
normal = sphereRotation * localIntersection;
```

And that’s it! We have effectively ~~rasterized~~ raytraced an ellipsoid.

Here’s the complete fragment shader source code, with a few added extra tidbits:

```
#version 430 core
in vec3 worldPos;
out vec4 color;
uniform mat4 model;
uniform mat4 view;
uniform mat4 perspective;
uniform vec3 camPos;
uniform vec3 sphereCenter;
uniform vec3 sphereScale;
uniform mat3 sphereRotation;
/**
* This function checks for whether (p - ro) intersects a sphere
* located at c, with radius r = 1.0.
*/
vec3 sphereIntersect(vec3 c, vec3 ro, vec3 p, out vec3 normal) {
mat3 sphereRotationT = transpose(sphereRotation);
vec3 rd = vec3(sphereRotationT * normalize(p - ro)) / vec3(sphereScale);
vec3 u = (sphereRotationT * vec3(ro - c)) / vec3(sphereScale); // ro relative to c
float a = dot(rd, rd);
float b = 2.0 * dot(u, rd);
float cc = dot(u, u) - 1.0;
float discriminant = b * b - 4 * a * cc;
// no intersection
if (discriminant < 0.0) {
return vec3(0.0);
}
float t1 = (-b + sqrt(discriminant)) / (2.0 * a);
float t2 = (-b - sqrt(discriminant)) / (2.0 * a);
float t = min(t1, t2);
vec3 intersection = ro + sphereRotation * (vec3(t * rd) * sphereScale);
vec3 localIntersection = ((mat3(sphereRotationT) * (intersection - c)) / sphereScale);
normal = sphereRotation * localIntersection;
return intersection;
}
void main() {
vec3 nor = vec3(0.0);
vec3 sp = sphereIntersect(sphereCenter, camPos, worldPos, nor);
if (sp == vec3(0.0)) {
discard;
}
// Update the fragment depth to prevent Z fighting
vec4 shadingPos = perspective * view * model * vec4(sp, 1.0);
shadingPos /= shadingPos.w;
gl_FragDepth = shaingPos.z;
// Cheap AF diffuse
float col = max(dot(normalize(vec3(1.0, 2.0, 3.0)), nor), 0.0);
// I love orange
color = vec4(col * vec3(1.0, 0.5, 0.0), 1.0);
}
```

As illustrated in the very first image, the current shading point (the point in our bounding box) and the **real** shading point (the point on the ellipsoid) are different, and therefore, we need to update the fragment depth by re-projecting the shading point back to NDC and update the depth to prevent Z fighting. But other than that, it’s pretty much swell.

The shader above can certainly be optimized in various of ways. For example, instead of straight up calculating the world intersection position, we can calculate the local intersection position first, using `localIntersection = u + t * d`

. In this way, we can obtain normal with much more ease. On that note, I strongly recommend you to check out SIBR Viewers’ ellipsoid rasterization shader, which I treated as a reference shader during my implementation. However, since I still implemented mine from scratch, things might range from a little bit different to wildly different.

In any case, now we know how to raytrace a sphere, then an ellipsoid. But why you ask? Well, the need to render a perfect sphere will come sooner or later, and there may soon be another blog post about it… But hey, who knows.

]]>I picked sokol_app and sokol_gfx for low-level graphics (it turns out I don’t need much), and of course, ImGui for the sweet, sweet immediate mode GUI. Using `ImGui::TextWrapped`

for most of the things even makes the app kinda responsive (?), and that’s just so cool. The result? A **fast**, **standalone** batch image compression tool on the internet, while also being fully local. No image uploads needed. The whole thing is just a static web page, with a ~780K WASM binary. Follow me on this journey to discover the future of frontend development (?).

To start off, let me explain why I want to do this. And the reason is fairly simple: I always have kind of a distaste for Electron and browser-based apps. I don’t know where it stems from, it’s just there. VSCode, Discord, and almost all productivity apps that claims to be cross-platform, you name it. I think JS-backed apps are clunky and slow, there’s often this notable frame drop and they always make me uncomfortable. It also doesn’t help that I make like a ton of them.

Graphics apps, on the other hand, always feel very responsive (in the sense of when I click, something happens without delay) and fast. That feeling’s like a drug, and I am fully addicted to it. With WASM, graphics apps have the true potential of porting to every device on the planet, including embedded devices. So to test things out, I thought up a thing I may need to repeatedly use - batch image compression in my case, and tried to implement it, see where it goes.

Alright, that’s a very long rant. I know JS code can be fast, but this is largely just personal opinion. Anyway, I hope you now see why I think WASM apps are worth exploring. So, let’s begin!

To kickstart the project, we will need a few libraries and tools. Emscripten is the first thing we need; we need `emcc`

and `em++`

to create WASM binaries. Then, clone floooh/cimgui-sokol-starterkit and compile it to get started. I replaced CImGui to ImGui because I prefer the C++ version more, and `sokol_imgui`

is capable of handling C++ ImGui anyway. If you disagree, you can just go on with CImGui, there really isn’t that much of a difference here.

```
# I added imgui to project root and changed up line 12 of the CMakeLists.txt in cimgui-sokol-starterkit:
add_library(imgui STATIC
imgui/imgui.cpp
imgui/imgui.h
imgui/imgui_widgets.cpp
imgui/imgui_draw.cpp
imgui/imgui_tables.cpp
imgui/imgui_demo.cpp)
target_include_directories(imgui INTERFACE imgui)
```

Then we need to switch up the `sokol.cpp`

, as we will be including `<imgui.h>`

now instead of `<cimgui.h>`

. The same goes for the `demo.c`

(which I renamed to `main.cpp`

):

```
#include <imgui.h>
```

By also replacing the CImGui calls in `main.cpp`

to ImGui calls, we should now be able to run the demo without any noticeable changes.

Next up, we need to design our interface. I want it to be website-like, so no ImGui windows (which we will fail later). The initial interface should also be clean, with a big button for image compression configs, and another big button to upload images. I have added a third big button to explain how the compression tool works and to shill myself.

We are going to delete the demo background color selection window, and replace it with a windowless ImGui window:

```
if (ImGui::Begin("Windowless", nullptr, ImGuiWindowFlags_NoTitleBar | ImGuiWindowFlags_NoBackground |
ImGuiWindowFlags_NoMove | ImGuiWindowFlags_NoResize | ImGuiWindowFlags_NoBringToFrontOnFocus))
{
ImGui::TextWrapped("IMZIP by 42yeah");
ImGui::Separator();
ImGui::TextWrapped("<INTRODUCTION>");
if (ImGui::CollapsingHeader("How ImZip works"))
{
// ... UI elements explaining how it works
}
if (ImGui::CollapsingHeader("Compression Configs"))
{
// Compression configurations
}
if (ImGui::Button("Upload images ..."))
{
// Handle image uploads...
}
ImGui::SameLine(); ImGui::TextWrapped("%d images selected.", 0);
// <GALLERY CODE>
}
```

After user has uploaded images to the WASM MEMFS, aka the in-memory filesystem, we want to display a little gallery as well, so that the user can see what they have uploaded, and then they can also manipulate the images. But that comes later. So far, UI-defining code is probably done - so let’s fill them with logic.

The most defining feature of an image compression tool needs to have these three (3) core features:

- The app should have the ability to receive images;
- The images should be viewable in-app;
- The app should be able to compress them;
- The compressed image should be downloadable.

Let’s go over them one by one.

So the very first thing we need to do is to handle image uploads. When the user clicks on the **“Upload images …“** button, a file dialog should appear. Normally, we would use ImGuiFileDialog, but as WASM runtime is a sandboxed environment with no knowledge of the host machine (the MEMFS is in-memory), we can’t really do that. And instead, we have to resort to good ol’ HTML + JavaScript (OH NO!). A file input needs to be added in the shell file `shell.html`

:

```
<input id="file-input" type="file" multiple onchange="upload()" accept="image/png, image/jpeg">
```

And when user clicks on the upload button, we can just simulate a click event, using `EM_ASM`

:

```
if (ImGui::Button("Upload images ..."))
{
EM_ASM(
document.querySelector("#file-input").click();
);
}
```

As we’ll see down the road, this method has a major pitfall, but it’ll have to do for now.

When the input changes, it calls the `upload()`

function, which somehow needs to tell our WASM module the files are ready. But before all that, we need the full file in MEMFS, or the WASM module won’t be able to load it. And to do that, we need to first load the file using a `FileReader`

, then write it into memory using the Emscripten Filesystem API `Module.FS`

. We need to update `CMakeLists.txt`

to export the filesystem APIs:

```
# I don't know if WASMFS works here but the documentation says it's better
# Relevant GitHub issue: https://github.com/emscripten-core/emscripten/issues/6061
target_link_options(imzip PRIVATE -sWASMFS -sFORCE_FILESYSTEM -sEXPORTED_RUNTIME_METHODS=['FS'] -sALLOW_MEMORY_GROWTH)
```

Loading the images in WASM tend to take up a lot of memory, and can easily overflow the 16M default limit, so we need `-sALLOW_MEMORY_GROWTH`

. The `upload`

function is listed below, which writes the list of images into the root directory of the MEMFS.

```
const files = document.querySelector("#file-input");
function upload() {
let doneFiles = 0;
for (let i = 0; i < files.files.length; i++) {
const reader = new FileReader();
function fileLoaded(e) {
// TODO: catch errors
const buffer = new Uint8Array(reader.result);
Module.FS.writeFile("/" + files.files[i].name, buffer);
console.log("File written: ", files.files[i].name);
doneFiles++;
if (doneFiles == files.files.length) {
uploadDone();
}
}
reader.addEventListener("loadend", fileLoaded);
reader.readAsArrayBuffer(files.files[i]);
}
}
```

`uploadDone`

will be called once all files have been uploaded. In `uploadDone`

, we need to figure out how to pass the list of images to our WASM module, which is actually not a trivial task. We have two available options:

- The cool option: somehow pass an array of strings to WASM.
- The wimp option: write the file list into a file and tell WASM to read that file.

I have opted to go for the wimp option (NOOOO!). The file list is written to `/info.txt`

and our WASM module is notified (`Module._images_selected`

) to check out that file.

```
function uploadDone() {
let fileInfo = "";
for (let i = 0; i < files.files.length; i++) {
fileInfo += "/" + files.files[i].name + "\n";
}
Module.FS.unlink("/info.txt"); // We don't care about the unlink result
Module.FS.writeFile("/info.txt", fileInfo);
Module._images_selected();
}
```

Though I have not read through these articles, it’s worth checking out if you want the cool option:

- How to pass a string to C code compiled with emscripten for WebAssembly
- Passing arrays and objects from JavaScript to c++ in Web Assembly
- WASM – Pass Array Between C++ And JS (man, this one has a lot of ads.)

Back to our WASM C++ code, we need to define and implement the `images_selected`

function. It should be wrapped in an `extern "C"`

block to avoid name mangling.

```
static struct
{
sg_pass_action pass_action;
// update the state to add the following:
std::vector<std::string> files;
std::vector<std::shared_ptr<Image> > images;
} state;
extern "C"
{
// https://stackoverflow.com/questions/61496876/how-can-i-load-a-file-from-a-html-input-into-emscriptens-memfs-file-system
void images_selected()
{
std::ifstream reader("info.txt");
if (!reader.good())
{
std::cerr << "Bad reader: info.txt?" << std::endl;
return;
}
std::string path;
while (std::getline(reader, path))
{
state.files.push_back(path);
std::shared_ptr<Image> img(new Image());
if (!img->load(path))
{
// TODO: a better error
std::cerr << "Cannot load: " << path << "?" << std::endl;
}
std::cout << "Image loaded: " << img->w << ", " << img->h << std::endl;
state.images.push_back(img);
}
}
}
```

The `Image`

here is a simple image wrapper class. It keeps track of some of the metadata of the image (width, height, and file name), and the image itself, in the form of an `std::unique_ptr`

. The image is loaded via stb_image. If you are curious about its implementation, the source code is available on GitHub.

Finally, we update the image number indicator so that it correctly reflects how many images have been chosen so far:

```
ImGui::SameLine(); ImGui::TextWrapped("%d images selected.", (int) state.files.size());
```

Time to test it out! Click on the **“Upload images …“** button, and a file dialog should popup. Choose as many as you want, and the following things will happen in order:

- Chosen images are read by a
`FileReader`

. - The images are written into MEMFS using
`Module.FS.writeFile`

. - List of images are written into
`/info.txt`

. - WASM module loads the list of images.
- WASM module loads each image individually using stb_image.

If you are perceptive, you will notice the image has been copied, *multiple times*, during the image loading procedure. That’s very unfortunate, and I have not figured out a way to reduce the number of copies so far. In fact, an extra copy will be created when we compress the image. Perhaps straight up passing the `Uint8Array`

into WASM module is better, but it is what it is.

Moving on, here’s what we see when we choose 5 images to upload:

Now that we have all these images, it’s time to make a little viewer. Sokol provides API for sokol texture-ImGui interoperation. We first create a sokol image (`sg_image`

):

```
sg_image_desc desc = {
.width = w,
.height = h,
.pixel_format = SG_PIXELFORMAT_RGBA8, // Hmm...
.sample_count = 1,
};
desc.data.subimage[0][0] = {
.ptr = image.get(),
.size = (size_t) (w * h * ch)
};
sg_image = sg_make_image(&desc);
```

Then create an `simgui_image_t`

based on the `sg_image`

:

```
sg_sampler_desc sam_desc = {
.min_filter = SG_FILTER_LINEAR,
.mag_filter = SG_FILTER_LINEAR,
.wrap_u = SG_WRAP_REPEAT,
.wrap_v = SG_WRAP_REPEAT
};
simgui_image_desc_t imgui_desc = {
.image = sg_image,
.sampler = sg_make_sampler(&sam_desc)
};
imgui_image = simgui_make_image(&imgui_desc);
```

The `imgui_image`

s can be rendered by ImGui using:

```
ImGui::Image(simgui_imtextureid(imgui_image), { w, h }, { 0.0f, 0.0f }, { 1.0f, 1.0f });
```

We create an `simgui_image_t`

for each uploaded image. After some formatting, we have made a little gallery:

I have gone one step further and make the images clickable by turning them into image buttons. A little preview window pops up when the user clicks on any of the images, and the user can choose to download *only this one* compressed image, or remove it from the gallery.

Those are mostly ImGui UI codes, so I won’t be putting them here. Check out the GitHub repo for more detail.

The app has two compression methods: compress one image and compress all images. Though you may say that they are exactly the same, and maybe they are, the difference lies in the fact that when multiple images are being compressed, we don’t want to spam the browser download window, and as a result, must compress the compressed images into a single archive before proceeding to download.

ImZip compresses the images by doing the following:

- If the image size is larger than a certain threshold (2K by default), we resize the image by half using stb_image_resize
- Re-encode the image in JPEG with a certain quality (using stb_image_write)

We want these parameters to be configurable as well, so we add them to the `state`

.

```
static struct
{
sg_pass_action pass_action;
ImVec2 fold_size;
int quality;
std::vector<std::string> files;
std::vector<std::shared_ptr<Image> > images;
std::vector<ImageInfo> image_windows; // this is for the windows pop-up mentioned above
} state;
```

Also, I *just* found out (as of the time of writing this blog post), that `stb_image_resize.h`

has been deprecated. If you are trying to implement, it is recommended that you use stb_image_resize2.h.

We first dicuss how to compress one image. This happens when the user clicks on the “Download” button in the preview popup window. Single image compression is very straightforward; we just need to implement the two steps above, write the result JPEG to MEMFS, and tell the browser to download it. So let’s do it!

```
void compress_one_image(const std::shared_ptr<Image> &im)
{
Image im_copy(*im);
bool failed = false;
while (im_copy.w >= state.fold_size.x && im_copy.h >= state.fold_size.y)
{
if (!im_copy.resize(im_copy.w / 2, im_copy.h / 2))
{
failed = true;
break;
}
}
if (failed)
{
return;
}
std::string path = "/";
path += im_copy.file_name + "_cmpr.jpg";
im_copy.save_compressed(path, state.quality);
download(path.c_str());
}
```

Again, the code is very straightforward. The only problem here is the `download`

function, which needs to tell the browser to download a specific file stored in MEMFS. This unfortunately would require JavaScript again, as WASM has no direct way to achieve that. Luckily, it is not that complex:

- Read the file from MEMFS using
`Module.FS.readFile`

. - Create a blob based on the data.
- Export a URL from the blob.
- Set the
`href`

of a`<a>`

element to the blob URL. - Fake-click on the hyperlink to start download.

```
EM_JS(void, download, (const char *path), {
const pathStr = UTF8ToString(path).replace("/", "");
const data = window.Module["FS"].readFile("/" + pathStr);
const blob = new Blob([data]);
const url = window.URL.createObjectURL(blob);
const downloadEl = document.querySelector("#download");
downloadEl.download = pathStr;
downloadEl.href = url;
downloadEl.click();
});
```

We use the `EM_JS`

macro to declare our `download`

function like a C function. Our `const char *`

string can be passed directly into JS code, so that’s extra cool. Take a look at the compression result:

That’s a 90.9% size reduction! Man, the stb libraries are just out of the world.

The compression procedure is basically the same for multiple images, but now we need to make them into one archive to prevent spamming the download window. To that end, I chose to go with miniz, a “lossless, high performance data compession library in a *single source file*”. To also prevent littering MEMFS with compressed images, this time we will compress it *in memory*, and then add them, one by one, to an archive stored in MEMFS. The only difference here for image compression is we need to change from `stbi_write_jpg`

to `stbi_write_jpg_to_func`

.

But what func? Maybe your initial thought is to write it into like, a `std::stringstream`

or something. But that’s not good enough. When we need to extract the buffer from a stringstream, we need to do a `ss.str()`

, which creates, you guessed it, another copy of the image. To counter that, we need to implement our very simple streaming structure:

```
struct CompressedInfo
{
std::string file_name;
char *buf;
int buf_size;
int ptr;
CompressedInfo();
~CompressedInfo();
void write(const char *what, int n);
};
```

The `CompressedInfo`

struct is a simple char string with dynamically allocated memory. The `buf_size`

starts at 1024 and expands twice everytime it fills up.

```
// DEFAULT_BUF_SIZE = 1024
CompressedInfo::CompressedInfo() : file_name(""), buf(nullptr), buf_size(0), ptr(0)
{
buf = new char[DEFAULT_BUF_SIZE];
buf_size = DEFAULT_BUF_SIZE;
ptr = 0;
}
void CompressedInfo::write(const char *what, int n)
{
if (n + ptr > buf_size)
{
buf_size *= 2;
char *new_buf = new char[buf_size];
memcpy(new_buf, buf, ptr);
delete buf;
buf = new_buf;
}
memcpy(&buf[ptr], what, n);
ptr += n;
}
```

Since we can directly access the underlying `buf`

pointer here, hurray! No copies created. Well, one copy created, because we need to copy the image for resizing and whatnot.

Next we need to put the image into said archive. The example2 of miniz is a good reference, and I have implemented my code based on that.

```
void compress_all_images()
{
// 1. Compress all images and put them into a vector
std::vector<std::shared_ptr<CompressedInfo> > infos;
for (int i = 0; i < state.images.size(); i++)
{
Image im(*state.images[i]);
// ... compress the image
std::shared_ptr<CompressedInfo> cpr = im.save_compressed_memory(state.quality);
if (cpr->ptr == 0)
{
continue;
}
infos.push_back(cpr);
}
// 2. Prepare the ZIP file
mz_bool status = MZ_TRUE;
mz_zip_error err;
// 3. Delete it if it exists
remove(state.archive_name);
char archive_path[1024], sprinted[1024];
// 3.5. Append a slash in front of the archive name
sprintf(archive_path, "/%s", state.archive_name);
// 4. Iterate through all compressed images
for (int i = 0; i < infos.size(); i++)
{
sprintf(sprinted, "%s_cmpr.jpg", infos[i]->file_name.c_str());
// 4.1. Add them to the archive
status = mz_zip_add_mem_to_archive_file_in_place_v2(archive_path, sprinted,
infos[i]->buf, infos[i]->ptr,
nullptr, 0, MZ_BEST_COMPRESSION, &err);
if (status == MZ_FALSE)
{
std::cerr << "Cannot compress? " << mz_zip_get_error_string(err) << std::endl;
break;
}
}
if (status == MZ_FALSE)
{
return;
}
// 5. Tell the browser to download it
download(archive_path);
}
```

And look at that! All done within a jiffy. Here’s the downloaded archive content:

In comparison, the original 5 images have a total size of 9.7M. And that’s it! An image compression tool written completely in WASM C++, except a few places where you can’t. But most of them are! Yay! And they live happily ever after. The end.

Of course. Of course Safari doesn’t like what’s happening. Remember the file upload dialog? When I was testing the tool, both iOS Safari and iPadOS Safari fails to produce that when the upload button is pressed. Some search later and I found out that for an `input`

click event to be triggerable from JS code, the input must **not** be hidden. **BUT!** other than that, during my extensive testing, I have found out another, hidden requirement:

The

`input.click()`

must be called in some sort of`onclick`

callback function, and it must be in response to an explicit user click (not from JS).

During my tests, when I click on a button that calls `input.click()`

, the file selector will appear; but if I give it a delay, say wrap it in a `setTimeout`

, it will just never appear. I think Safari has some weird requirements here which requires the real user click event to be propagated down the stack for the file input click event to be active. And that’s very, very annoying because we have no way to do that as the user click is captured by the canvas and somehow propagated into our ImGui button. The context may already be long lost, but I am not sure about the core issue here. Nevertheless, I have thought up two solutions:

- The cool option: directly handle the
`canvas.onclick`

and compare the cursor click location to the button location by performing a bounding box check - The wimp option: detect if the browser is Safari. Add an HTML overlay on top of the webpage when the button is clicked, and tell user to click the HTML button again so that the dialog can appear

Guess which route I took? Yeeaaah…

This in my opinion, is a critical flaw, as our WASM app is now no longer a *pure* WASM app. Well, it wasn’t that pure to begin with, but the file upload input and the hyperlink can hide safely under our canvas facade. With this abrupt popup? Not anymore. But again, it is what it is. Hopefully one day Safari can do something about it.

The popup itself is simple enough. Just a `<div>`

with `position: absolute`

.

```
<div class="hidden heck-safari">
<div class="prompt">
If nothing happens, please click the following button to select the files you want to compress.
</div>
<div class="button" onclick="document.querySelector('#file-input').click()">Upload images ...</div>
<div class="button" onclick="document.querySelector('.heck-safari').classList.add('hidden')">Dismiss</div>
</div>
```

In our WASM module, we do an extra check in the upload button:

```
if (ImGui::Button("Upload images ..."))
{
EM_ASM(
if (navigator.userAgent.indexOf("iPhone OS") != -1 ||
(navigator.userAgent.indexOf("Intel Mac OS X") != -1) && navigator.userAgent.indexOf("Chrome") == -1) {
document.querySelector(".heck-safari").classList.remove("hidden");
}
document.querySelector("#file-input").click();
);
}
```

And that’s it! And **now** they lived kinda-happily ever after.

That’s it for now. Although we still have to resort to good ol’ JavaScript at places, most of the work can be done in WASM. My one wish is that in the future, Emscripten can have some new API for direct DOM manipulation. There are some other optimizations that I haven’t mentioned, for example only redrawing when events happen, to prevent continuous high CPU usage. When static, our application takes as much CPU as the next static webpage.

Though not 100% the satisfactory full WASM implementation like we wanted, this is still a wonderful journey, combining lots of libraries and technologies. It’s quite fun making a small project like this from time to time. I have also made the webpage title *very* click-baity, so let’s see if people Googling tools like this will actually use it. Well, until next time. Toodles!

- Sokol: minimal cross-platform standalone C headers, GitHub
- How do I export FS in js file?, GitHub
- Emscripten Filesystem API
- How can I load a file from a HTML input into Emscripten’s MEMFS file system?, StackOverflow
- Emscripten: Interacting with code
- Passing arrays and objects from JavaScript to c++ in Web Assembly, StackOverflow
- Sokol example: imgui-images-sapp.c, GitHub
- Sokol example: loadpng-sapp.c, GitHub
- miniz: Single C source file zlib-replacement library, GitHub
- miniz example2.c, GitHub
- calling click event of input doesn’t work in Safari, StackOverflow

Looking back, we have set sail on so many different journeys. Every week is a different adventure. We took a sneak peak at the highs and lows, the basics and advanced topics of computer graphics (and something else): from framebuffers to raymarching, from dithering to ReSTIR. Alright, sometimes I got lazy, and I just reuse the old ones - but hey, it counts anyway.

To all my subscribed readers, friends, and random internet strangers: thank you for reading my blog. I know, it didn’t reach one year, and I am kinda sad about that - but nevertheless, I am honored to have you here. And worry not - I didn’t say I will stop updating it entirely, just that it will no longer receive regular updates. Every once in a while, when the strong urge to share arises again, this blog shall always receive new blog posts. Maybe I will even share new things encountered while I was working my job. And until then, I will see you all later.

]]>So the very first thing is to install PyTorch. And to install PyTorch (the GPU-enabled version, obviously,) you can follow this helpful link on the official website. If, however, our CUDA version is too low and becomes unlisted (again,) we can follow this link and just search for our version. In our case (CUDA 11.1,) we see that the latest supported version is `torch 1.10.1`

.

So we copy & paste that into our command line and BAM! PyTorch done.

After that, try to `import torch`

and verify that we can indeed create variables using `cuda`

:

```
import torch
torch.randn(1, 1, device='cuda')
```

Wow! It is on device! Let’s go!

For unsupported OSes, sometimes there is no precompiled PyTorch wheels, and that calls for a complete compilation from source. If such cases arise, it is very important for us to choose one compiler throughout the entire compilation (i.e. the compiler needs to be the same for both PyTorch and PyTorch Geometric.) If you are using an archaic version of, say, Ubuntu, which leads to an ancient GCC and G++, I highly recommend you to install Clang instead. For Ubuntu users, you can install Clang using `llvm.sh`

, with a comprehensive guide available here. The Clang version can’t be too new or too old, otherwise we can’t compile PyTorch Geometric. My recommended version is clang-6.0, which works for me.

We can tell `pip`

to use our new compiler once we have finished the installation:

```
export CC=clang-6.0
export CXX=clang++-6.0
# And now install PyTorch
pip install torch==...
```

PyTorch Geometric (PyG for short) is more like a toolkit with four major dependencies:

- torch_scatter
- torch_sparse
- torch_cluster
- torch_spline_conv
- pyg_lib (???)

If you have relatively new PyTorch and CUDA installed, you can just follow the guide from its official website. Otherwise, we will have to take the matter into our own hands. Note that on its official website, there is a find link for specific PyTorch and CUDA versions:

By copy-pasting it and replacing the torch version and CUDA version with our version (PyTorch 1.10.1 + CUDA 11.1), we get https://data.pyg.org/whl/torch-1.10.1+cu111.html. And after accessing it we are presented with a list of supported versions of PyG dependencies:

After a quick look we can see that supported versions of PyG dependencies given our CUDA and PyTorch versions are:

- torch_cluster: 1.5.9, 1.6.0
- torch_scatter: 2.0.9
- torch_sparse: 0.6.12, 0.6.13
- torch_spline_conv: 1.2.1

Now we simply need to pick their newest versions (or not as new, it’s really up to you) and `pip install`

them.

```
pip install torch_scatter==2.0.9 torch_sparse==0.6.13 torch_cluster==1.6.0 torch_spline_conv==1.2.1 -f https://data.pyg.org/whl/torch-1.10.1+cu111.html
pip install torch_geometric
```

**WARNING!** It’s very important that you specify the version here, otherwise pip will completely disregard the find link and just download the newest version of the four dependencies. You might have noticed that `pyg_lib`

is absent from the above list. I have noticed that as well, but since my task then was to simply install PyG, I disregarded the discrepency and no errors comes off it. I guess my PyTorch Geometric version is too old for the `pyg_lib`

thing.

If you want some new thrilling adventure (or if your OS is too old, as stated above,) you can also compile PyG from source. The procedure is basically the same as above. We have to guarantee the compiler remains the same.

However, if you are using Clang as advised, you will encounter two errors.

- The linker will complain something about “libomp5-10”. I am not sure about this, but this seems to be due to PyTorch’s wheel’s own OpenMP library clashes with system-wide OpenMP. This can be solved by specifying the OpenMP library:
`CFLAGS="-fopenmp=libiomp5" pip install torch_cluster --no-cache`

- Some weird template error at PyTorch’s
*variant.h*. Funnily enough, I found the solution to both issues on macOS devices with Clang trying to install PyG. This reply solves this issue.

After a long wait, you should be able to build all four (five?) dependencies from source. After that, just install PyG (or build it as well.)

That’s it! A grand adventure of version wrangling, all due to the outdated CUDA version on the server side. But I guess that’s why it’s fun, right? To verify that PyG has been definitively successfully installed, I try to import all four dependencies independently, and try to print out its CUDA version:

```
import torch_scatter
print(torch_scatter.torch.version.cuda)
# ... and so on.
```

And that’s about it. Have fun using PyG!

]]>Now, you must be wondering why you should read my blog post if there are so many far better tutorials out there. And you are right, boi. I have absolutely no confidence in what I write, and that’s why I leave all those references. So, Worley Noise! If you didn’t hear about this before, or Cellular Noise, or something to that extent, maybe you’ve heard the Voronoi diagram, right? It looks a little bit like this:

And with a little bit of tweak, BAM! It becomes Worley Noise! Let’s begin!

It’s actually really, really simple. First, Let’s get a canvas:

And then add a few random dots:

Then, for every pixel inside this canvas, we calculate the closest dot to the pixel. After this iteration, every pixel will be colored (because there will **always** be a closest point), and things would look triangular and cool.

Well, doodling sucks. Let’s take a look a the code!

```
// Five random points
vec2 points[5];
points[0] = vec2(0.3, 0.8);
points[1] = vec2(1.2, 0.1);
points[2] = vec2(1.0, 0.5);
points[3] = vec2(0.2, 0.4);
points[4] = vec2(0.6, 1.0);
// Keep track of the minimum distance,
float m = 1.0;
// and the closest dot
vec2 closestDot;
for (int i = 0; i < 5; i++) {
float dist = distance(uv, points[i]);
// Distance closer than minimum distance?
if (dist < m) {
// Update it
m = dist;
closestDot = points[i];
}
}
// Set the pixel output color's R & G component to be the position of the dot
gl_FragColor = vec4(closestDot, 0.0, 1.0);
```

And now obviously all color that is closest to the point would be **colored** as areas. Which is very cool! And yeah, it is that easy. It’s cool, because it gets irregular shapes.

`if`

s, no `but`

sColoring areas are nice and all, but it’s a little bit boring. Also it gets this `if`

. We all know that `if`

s in GLSL is bad! Bad for performance! Bad `if`

! Well, we can take a step back, and instead of coloring areas, we set the output pixel’s color to the closest point’s distance. In this way, our output image would become a continuous grayscale (or whatever scale you prefer) image:

This, by the way, could make really good looking lava moats, or dry rocks, if you could think about a way to animate those random dots, which we will cover later:

Just a different color! Well, let’s take a look at the source!

```
vec2 points[5];
points[0] = vec2(0.3, 0.8);
points[1] = vec2(1.2, 0.1);
points[2] = vec2(1.0, 0.5);
points[3] = vec2(0.2, 0.4);
points[4] = vec2(0.6, 1.0);
// ONLY keep track of the minimum distance.
float m = 1.0;
for (int i = 0; i < 5; i++) {
float dist = distance(uv, points[i]);
m = min(m, dist); // Or just minimize it to
// m = min(m, distance(uv, points[i]));
}
// Set the pixel output color's grayscale component to be the distance to the closest point
gl_FragColor = vec4(m, m, m, 1.0);
```

Easy peasy lemon squeezy!

`for`

`if`

s are gone now, and that’s good; but `for`

still exists. And as there is actually a `if`

in `for`

, `for`

isn’t good and thus we should remove it as well. We could procedurally generate the points on the way. Not only it saves up memory as generating points use pure maths, we could have infinite points, and thus the voronoi diagram can expand to infinity.

But first, we will just take a look at how we will remove `for`

. So how could we do that, actually? Well, of course we should use the space tiling technique:

```
uv *= 3.0; // zoom out; change it as you please
vec2 u = floor(uv);
vec2 f = fract(uv);
gl_FragColor = vec4(sin(u), 0.0, 1.0);
```

In this way, the space could be tiled elegantly:

*every color grid is a tiled space*. Also in this image, I zoomed way out so you could see the tiled effect properly. Now after tiling, `f`

becomes our new `uv`

; it is standarized, as it’s always in [(0, 0), (1, 1)). Then, We could just generate a point in every tile!

```
// our dear one-liner
vec2 rand(vec2 u) {
return fract(sin(vec2(dot(u,vec2(127.1, 311.7)),
dot(u,vec2(269.5, 183.3)))) * 43758.5453);
}
void main(void)
{
vec2 uv; // Get uv in some way
uv *= 3.0; // Zoom out
vec2 u = floor(uv);
vec2 f = fract(uv);
vec2 p = rand(u); // So every tile's point will always be the same
float m = distance(f, p);
gl_FragColor = vec4(
m, m, m,
1.0);
}
```

Well, remember our `rand`

function? If not, check it out here. After tiling & generating, we could make sure all tiles always get a point located at somewhere. However, it doesn’t look like the original Worley stuff again, because there is only one point left to compare; there is no way to get the position of other points!

Well obviously there is! As we can see clearly, the closest point to a pixel only have at most 9 possibilities. Take the pixel in green circle for example.

In other words, its 9 neighbros. Other points are **impossible** to be closer to the 9 neighbors, right? So what we need now is a double loop (well, power comes at a cost):

```
vec2 rand(vec2 u) {
return fract(sin(vec2(dot(u,vec2(127.1, 311.7)),
dot(u,vec2(269.5, 183.3)))) * 43758.5453);
}
void main(void)
{
vec2 uv; // Go get it yourself
uv *= 3.0; // Zoom out
vec2 u = floor(uv);
vec2 f = fract(uv);
float p = 1.0; // Assume furthest
for (int y = -1; y <= 1; y++) {
for (int x = -1; x <= 1; x++) {
vec2 off = vec2(x, y);
p = min(p, distance(f - off, rand(u + off)));
}
}
gl_FragColor = vec4(
p, p, p,
1.0);
}
```

And here we go!

***ZOOM***

Infinite Worley noise, at your disposal!

Adding isolines could make the thing kinda sorta looks like a triangular contour. First we are gonna use the Voronoi code, only it is the infinite version:

```
uv *= 10.0;
vec2 u = floor(uv);
vec2 f = fract(uv);
float m = 1.0;
vec2 mPos;
for (int y = -1; y <= 1; y++) {
for (int x = -1; x <= 1; x++) {
vec2 p = rand(u + vec2(x, y));
float d = distance(f - vec2(x, y), p);
if (d < m) {
m = d;
mPos = p;
}
}
}
vec3 color = vec3(rand(mPos), 1.0); // use rand so the color does not go out of bound
// and set B=1.0 so the screenshot looks samsungy
color -= abs(sin(m * 80.0)) * 0.01; // This line shows the isoline
gl_FragColor = vec4(
color,
1.0);
```

This is what it looks like without isoline: (Oooh, Samsung Galaxy something!)

Then we subtract the output color by abs(sin(distance to the point * n)) * k. tweak the n for the stripe count you want, and tweak k for the stripe obviousness. And when you set k = 1, and zoom level = 2.0 or something, you will get a cool looking neon effect! That’s exactly how I get the featured image. Also lower the zoom level. This is what it looks like with isoline:

Animating the scene is extremely easy. Just use a sine function to move the generated points around (but not too much, otherwise it will move so far out it would actually be the closest in other tiles, and the image would look jagged). Also, the color hash trick above would not work, as the closest point will **change** now. It will result in rapidly blinking. Solution exists, of course; but you gotta think about a way.

```
uv *= 10.0;
vec2 u = floor(uv);
vec2 f = fract(uv);
float m = 1.0;
vec2 mPos;
for (int y = -1; y <= 1; y++) {
for (int x = -1; x <= 1; x++) {
vec2 p = rand(u + vec2(x, y));
// Just gotta add one single line here
p = 0.5 + 0.5 * sin(1.23 * time + 10.1 * p);
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
float d = distance(f - vec2(x, y), p);
if (d < m) {
m = d;
mPos = p;
}
}
}
vec3 color = vec3(rand(mPos), 1.0);
gl_FragColor = vec4(
color,
1.0);
```

Well, we are ending things here. I’ve truly learned some interesting effects today. It could fake 3D balls effect, without actually Ray marching; you know that’s hard. We could also just use it like regular noise, even though Worley noise has less application than Perlin noise. There are usages for it, of course, and I hope one day they could actually come in handle!

Hi! Things have been busy for me. I have been trying to lock onto computer graphics jobs, so if any of my 3 viewers know some job openings, please let me know. Also, I appreciate your presence. Truly I do.

Back to the main topic. I don’t have much time this week; so I spent them on rendering Spot, optimizing BVH construction, and adding a few things here and there (for example, adding uv interpolation, and rejection-sampling ray generataion.) I have also added texture sampling so now the renderings don’t have to be made out of crystals anymore. Let’s take a look at some of them!

First we see our dear global illuminated Spot. In the current state however, since we don’t have a proper material system in place, this is simply rendered via a two-bounce low-budget fake GI method:

- Rays are generated and traced from the camera.
- Upon intersecting with a triangle, 5 new rays are reflected from the surface (surfaces are always Lambertian.)
- Now trace these 5 rays. When they intersect with a triangle, we calculate its brightness via the Phong shading model (without specular). If it doesn’t, we sample the sky brightness over there.
- We average over the brightness these 5 rays bring back, and combine that with the texture color to produce the final color.

After gamma correction, a matte-looking Spot is produced. I think that’s kind of cute!

Here’s the Cornell Box rendered using the same approximate method. You can see the brightness calculated from the Phong shading model in the reflection of the balls. There is also some kind of light leaking around the edges. To increase realism (because the light source is kind of small here,) the second-bounce sample size is increased to 100 (except the ball reflection - that’s still 1.)

This Sponza model is downloaded from here and man, it’s huge. Even fake GI is too much for LuaPT right now; and I think there’s also something wrong with the textures at the moment. I have tried rendering this scene with the same parameters as the cornell box, and it took me 30 full minutes with unsatisfactory result (I have tuned the samples down to 5 - don’t think that 100 samples are doable.)

So, that’s where we are at the moment. No big improvements this week, but a few QoL changes here and there. Maybe later down the line, I will turn this into a proper open source project by having a `README.md`

and force all my friends to star it. My current plan is to implement BxDF in the short, forseeable future; due to the immense time needed to render Sponza, I guess you guys won’t be seeing it for another good while. So that’s all guys. See you next time!

You can check out the source code here.

]]>Hi! This week marks a great change for LuaPT. To further accelerate the tracing process, multiple things have been optimized and changed, and some are rebuilt from the ground up. This week, we will be introducing LuaJIT, and most importantly, the ffi extension. After that, we will implement a BVH acceleration structure in full Lua (or at least 80% Lua). LuaJIT has accelerated Lua so much and ffi has greatly simplified C++/Lua interoperation, I don’t even know where to begin. So let’s begin!

I have started trying improving the speed of LuaPT since week 4. Since week 3 begins LuaPT’s performance has been rapidly degrading and I know that soon enough, it will be so slow that I might as well trace the rays by hand. But this is when LuaJIT comes to the rescue. By simply replacing my exisitng Lua library and linking LuaJIT instead I can already see a significant increase in speed.

Next, by using ffi extension I ported my whole math library to C++.

As models & meshes are represented using a continuous `glm::vec3`

vector, we can exploit this fact and force a pointer cast when the Lua script is asking for triangles. Vector math in C++ is magnitudes faster than in Lua, and by doing so I achieved another speed up.

The ffi library is so cool. I just need to implement the functions in C++ (with an `extern C`

block), then copy & paste my header into a `ffi.cdef`

block. Now I can just call these functions in Lua. It’s crazy how convenient it is.

The switch to LuaJIT is time-consuming but nothing too technical. All the previous methods and APIs (including `Image`

and `Model`

) ported onto LuaJIT, however losing their object-oriented APIs in the process. But I think it’s a rather small sacrifice to obtain the Speed. If you are interested in learning how LuaJIT works, I strongly recommend you check out the official website. It’s short, concise, and in less than 10 lines of code showed me how I can use it. It’s awesome. So after deprecating the Lua library and switching to LuaJIT, I begin working on a few other improvements.

As we will be implementing a BVH, we have to implement a ray-box intersection method first. Here we will make use of the one in Scratchapixel, aka the slab method.

```
function intersect_box(ro, rd, box)
local rdinv = vec3(1 / rd.x, 1 / rd.y, 1 / rd.z)
local tmin = -1e309 -- 1e309 is infinity in Lua
local tmax = 1e309
if rd.x ~= 0 then
local tx1 = (box.min.x - ro.x) * rdinv.x
local tx2 = (box.max.x - ro.x) * rdinv.x
tmin = math.max(tmin, math.min(tx1, tx2))
tmax = math.min(tmax, math.max(tx1, tx2))
end
if rd.y ~= 0 then
local ty1 = (box.min.y - ro.y) * rdinv.y
local ty2 = (box.max.y - ro.y) * rdinv.y
tmin = math.max(tmin, math.min(ty1, ty2))
tmax = math.min(tmax, math.max(ty1, ty2))
end
if rd.z ~= 0 then
local tz1 = (box.min.z - ro.z) * rdinv.z
local tz2 = (box.max.z - ro.z) * rdinv.z
tmin = math.max(tmin, math.min(tz1, tz2))
tmax = math.min(tmax, math.max(tz1, tz2))
end
if tmax >= tmin then
return tmin, tmax
else
return nil
end
end
```

Of course then we will need a bounding box data structure. The bounding box `BBox`

is first defined in plain C then passed onto Lua using LuaJIT.

```
typedef struct
{
Vec3C min, max;
} BBox;
```

The brute force method of ray-triangle intersection test introduced in week 2 is also starting to slow us down. To counter this issue, I have switched the ray-triangle intersection test to a faster one, namely the *Moller-Trumbore* method. both the Scotty3D website and Scratchapixel are excellent learning sources.

```
function intersect_mt(ro, rd, tri, tmin, tmax)
local e1 = sub3(tri.b.position, tri.a.position)
local e2 = sub3(tri.c.position, tri.a.position)
local pvec = cross(rd, e2)
local det = dot3(e1, pvec)
-- Are they almost parallel?
if (math.abs(det) < 0.0001) then
return nil
end
local inv = 1.0 / det
local tvec = sub3(ro, tri.a.position)
local u = dot3(pvec, tvec) * inv
if u < 0 or u > 1 then
return nil
end
local qvec = cross(tvec, e1)
local v = dot3(rd, qvec) * inv
if v < 0 or u + v > 1 then
return nil
end
local t = dot3(e2, qvec) * inv
if t < tmin or t > tmax then
return nil
end
return vec3(u, v, t)
end
```

To further increase render precision, add more zeros to `if (math.abs(det) < 0.0001) then`

. The implementation of Moller-Trumbore method gave us a slight speed increase for each pixel.

And finally, now it’s time for us to construct a BVH. We can’t implement the whole BVH in Lua, because multithreading is achieved through multiple Lua instances and there isn’t an effective way to sync Lua variables between Lua states. So instead, the `BVH`

data structure will be defined in C/C++, with a few getters/setters, but no serious functions that could construct a BVH directly.

The only notable functions are the constructor of `BVH`

and the `partition`

function. We will go over them one by one. In `BVH::BVH`

, we accept a pointer to a `Model`

. The `BVH`

class then copies all triangles into its own array, and creates a root node in the `nodes`

variable. The `nodes`

variable acts like a complete binary tree stored in continuous memory. You can think of it as a heap.

```
BVH::BVH(std::shared_ptr<Model> model) : model(model)
{
// First make an empty node to fit ALL triangles inside
BBox root_bbox = bbox();
for (int i = 0; i < model->get_num_tris(); i++)
{
TriC *t = model_get_tri(model.get(), i);
// The enclose method encloses a point into a bounding box
enclose(root_bbox, t->a.position);
enclose(root_bbox, t->b.position);
enclose(root_bbox, t->c.position);
tri.push_back(&model->get_triangle(i));
}
make_node(root_bbox, 0, tri.size(), 0, 0);
}
```

The `make_node`

method is just that - it makes a node, pushes it onto `nodes`

, and return its index. The node is defined as such:

```
typedef struct
{
BBox bbox;
int start, size;
int l, r; // Left child & right child
} Node;
```

In this case, we construct a root node for our BVH, make it contain the full triangle array (0 to `tri.size()`

), and its bounding box enclosing all the vertices of the input mesh.

Because `BVH`

stores triangles in whatever order the input mesh give us, we will need some way to rearrange the order of triangles. And to stay true to our project name, Lua has to be the one who does the job. `std::partition`

therefore is out of the question - its final parameter `pred`

requires a lambda function in C/C++. We therefore have to resort to a homemade partition method. By accepting an array of booleans, and putting the `true`

values to the left, `false`

values to the right, we can partition basically anything coming from Lua, with just an extra smidge of memory cost.

```
int BVH::partition(bool *pred, int begin, int end)
{
assert(begin >= 0 && end < tri.size() && "Invalid partition range");
while (begin <= end)
{
if (!pred[begin])
{
const Triangle *t = tri[begin];
tri[begin] = tri[end];
tri[end] = t;
// Gotta swap that pred as well
bool p = pred[begin];
pred[begin] = pred[end];
pred[end] = p;
end--;
}
else
{
begin++;
}
}
return begin;
}
```

That’s the whole of C++ part. As you can see, not a lot is going on in there. The real interesting thing happens in Lua.

- The current (root) node and its bounding box is obtained via the Lua ffi API.
- For each axis, 8 valid partitions among the bounding box domain is tried.
- A “Surface Area Heuristic” (SAH) is evaluated for each partition along each domain, and the best one with the lowest score is kept.
- Finally revert all the changes and execute the partition method with the best SAH.
- The mesh is now split into 2 following the partition; make 2 new nodes and for each of them, either make leaf or continue partitioning (back to step 1).

Demonstrated below. First, we find the perfect partition with the lowest SAH score:

Then, we partition the triangles within the node accordingly. Then, we split the array and give them to two new children, and recurse to obtain the best partition for them. This goes on until there aren’t enough triangles to partition, or the best partition is no partition. Then we stop.

For us, the SAH equation is

\[\text{SAH} = C_\text{trav} + \frac{S_\text{left}}{S} N_\text{left} C_\text{isect} + \frac{S_\text{right}}{S} N_\text{right} C_\text{isect}\]In which \(C_\text{trav}\) (the travel cost) is a constant of 1 and the intersect cost \(C_\text{isect}\) is 2. The \(\frac{S_\text{left}}{S}\) and \(\frac{S_\text{right}}{S}\) are simply the split ratio (the \(\frac{k}{8}\) above).

This is how that looks in Lua:

```
-- First call the BVH constructor with a model
local bvh = make_bvh(model)
function determine_side(p, offset, axis)
if axis == 0 then
return p.x < offset.x
elseif axis == 1 then
return p.y < offset.y
else
return p.z < offset.z
end
end
function determine_area_ratio(poff, span, axis)
if axis == 0 then
return poff.x / span.x, (span.x - poff.x) / span.x
elseif axis == 1 then
return poff.y / span.y, (span.y - poff.y) / span.y
else
return poff.z / span.z, (span.z - poff.z) / span.z
end
end
function bvh_construct(node_idx, start, fin)
if fin - start + 1 <= 8 then
-- No need to make BVH; not a lot of triangles here
return
end
local n = bvh_get_node(bvh, node_idx)
local span = sub3(n.bbox.max, n.bbox.min)
-- split them into 8 buckets
local num_buckets = 8
local split_step = scl3(span, 1 / num_buckets)
-- Bests
local best_axis = 0
local best_step = 1
local best_sah = 1e309
local best_offset = 0
for axis = 0, 2 do
-- Tentatively partition them along these axis
for step = 1, 7 do
-- 1. Somehow partition it.
local table = make_partitioning_table(bvh)
local poff = scl3(split_step, step) -- Plane offset
local plane = add3(n.bbox.min, poff)
for i = start, fin do
local tri = bvh_get_tri(bvh, i)
local centroid = scl3(add3(add3(tri.a.position, tri.b.position), tri.c.position), 1 / 3)
table[i] = determine_side(centroid, plane, axis)
end
local offset = partition(bvh, table, start, fin)
-- 2. Calculate SAH. Record the best one.
local left, right = determine_area_ratio(poff, span, axis)
local sah = 1 + 2 * left * (offset - start) + 2 * right * (fin - offset + 1)
if sah < best_sah then
best_sah = sah
best_step = step
best_axis = axis
best_offset = offset
end
end
end
-- That's not very constructive.
if best_offset == start or best_offset == fin + 1 then
return
end
-- Partition using the best one.
local table = make_partitioning_table(bvh)
local poff = scl3(split_step, best_step)
local plane = add3(n.bbox.min, poff)
for i = start, fin do
local tri = bvh_get_tri(bvh, i)
local centroid = scl3(add3(add3(tri.a.position, tri.b.position), tri.c.position), 1 / 3)
table[i] = determine_side(centroid, plane, best_axis)
end
local offset = partition(bvh, table, start, fin)
local left_box = bbox()
local right_box = bbox()
for i = start, offset - 1 do
local tri = bvh_get_tri(bvh, i)
enclose(left_box, tri.a.position)
enclose(left_box, tri.b.position)
enclose(left_box, tri.c.position)
end
for i = offset, fin do
local tri = bvh_get_tri(bvh, i)
enclose(right_box, tri.a.position)
enclose(right_box, tri.b.position)
enclose(right_box, tri.c.position)
end
local l = bvh_push_node(bvh, left_box, start, offset - start, 0, 0)
local r = bvh_push_node(bvh, right_box, offset, fin - offset + 1, 0, 0)
bvh_node_set_children(bvh, node_idx, l, r)
-- Recurse into l and r ???
bvh_construct(l, start, offset - 1)
bvh_construct(r, offset, fin)
end
-- Construct BVH for the whole thing
bvh_construct(0, 0, bvh_tri_count(bvh) - 1)
```

With the shiny new BVH in place, combined with LuaJIT, our render is now much, much faster. Compared to a non-BVH (yes-LuaJIT) rendering of Spot the cow, which can take up to 10 minute, BVH-enabled version of the same scene now only takes 2 and a half minutes.

Here’s the Blender monkey Suzanne:

Aaaand here’s a 720p torus.

If you have you missed it, here’s the link to the source code. Hit a star if you feel like it. It will always shoot a stream of dopamine straight to my brain. But in any case, I will see you next week!

- LuaJIT, the Just-in-Time Lua compiler
- LuaJIT ffi, the ffi extension documentation
- Ray-Box Intersection, Scratchapixel)
- Moller-Trumbore Fast Ray-Triangle Intersection, Wikipedia
- Surface Area Heuristic (SAH), CMU 15462/662
- Fast, Branchless Ray/Bounding Box Intersections
- How to create awesome accelerators: The Surface Area Heuristic