platonvin: Voxels

What?

Lum is a voxel renderer i develop to power some of my future games. It is not an extendable general purpose engine, rather fast and simple library.

I am not an artist, and Lum embraces that - you are expected to reuse assets, program animations and draw some things with shaders.

key structures

Material

Represent type of voxel by its properties, such as color, roughness, emittance, etc.

Material palette

Array of materials. Currently, limited by 255 (256, but first one is never accessed and represent "air")

Voxel

An index into a material palette (so, data of voxel is its material). Represented by single u8. 0 means "air" - emtpy voxelz

Model

A free-form, self-aligned (so not world-aligned) voxel mesh of any size (up to 255x255x255 to fit in u8). Each model's voxels live in its own 3D image.

Block

A grid-aligned voxel mesh, exactly 16x16x16 in size. Blocks voxel data lives in block palette

Block palette

a large 3D image storing many 16^3 sub-blocks (we use an image cause hw sampling & tiling, but anything would do the job).

Blocks in block palette come in two flavors:
- Static - immutable assets, e.g. loaded from files
- Dynamic - ones that are [created and modified] every frame to represent some dynamic changes. When a model needs to be placed in the world, we want to represent its voxels in our data structure, but we cant modify the blocks it occupies - their data is origin for multiple instances. Instead, a new "dynamic" blocks are allocated in the palette, the original static blocks data is copied into it, and the world grid's pointer is updated to this new (mutable) copy. This is effectively a `Cow` - Copy-on-write
Note: code that reads a block pointer doesn't care whether it's static or dynamic - all processing is identical.

World

A 3D grid of Block pointers (indices into that block palette).

Radiance field

A 3D grid of lighting probes, one per block, used for fast approximate global illumination.

Rendering pipeline

Before renderer even starts, "render requests" are collected from user code. Render request is a "command" to renderer that looks like {draw this block id at this position} or {draw this model with this rotation and this translation}. Then, the pipeline looks something like this:

CPU preparation: Some regular GAPI-related CPU work. We also sort requests by depth (for Vulkan) / state (for WGPU), update per-frame GPU data and figure out which radiance probes need to be updated this frame (any block with a non-empty neighbor) and upload their positions to the GPU (btw this is current bottleneck).
Lightmaps: Sun lightmaps for blocks and models, pretty common
GBuffer Pass: Lum is a "deferred" renderer, this stage determines voxel material and normal for each pixel
Compute: Updates all lighting state (radiance fields) and builds voxel representations of models in out world.
Shading: Actually draws stuff on screen - diffuse & glossy, ambient occlusion, reflections, volumetrics, tonemapping and outputting into swapchain image

Lightmaps

(believe me, this boring part is necessary for an interesting one after)
We start by generating lightmaps for blocks and models. Lightmaps are a classic approach, and for this, we don't need voxel materials (until I implement transparency one day). We just need the geometry's "contour" - the vertices that define its silhouette.

To render this contour, we bind its vertex buffer and use push constants to pass in data for each draw call.

For blocks, which are grid-aligned, this is just their position.
For models, this is a vec4 quaternion for rotation and a vec3 for translation.

Since we only need model-space position for vertices, models are not expected to be large, and vertices of a voxel are snapped to its corners, each vertex position can be a tiny i8vec3 - just 3 bytes (in fact, model sizes are limit to 255 to fit into u8).

Also, for better culling, we don't actually store the contour as a single triangle mesh. We abuse the fact that voxel sides are grid-aligned and only have 6 possible normals. The mesh is divided into 6 sections, one for each normal, and we cull them separately against the camera direction (btw some big games also use it, but it works best for voxels).

if you are wondering why rasterization and not raytracing - rasterization IS raytracing - it is optimization of specialized case of raytracing

At this point, the lightmap command buffers are done (of course, we execute them before shading).

GBuffer

Lum is a "deferred" renderer, and in this next stage, we determine the voxel material and normal for each pixel. This is where things get interesting. We actually don't need a second mesh - the contour we used for lightmapping is enough.

Since we render the 6 sides of the contour separately, they all have the same normal, which we just pass in push constants.

What about the material though? In the past, Lum encoded it into the vertex data (that's the obvious thing to do). This required a lot of vertices (16x16 side of a block with random material voxels would generate a lot of vertices since they all have different material, unlike for contour, which can merge them) and Lum was (as you might have guessed from experience), vertex-bound.

The fix was to move this work into the pixel shader. The vertex shader passes the interpolated model-space position of the vertex into the fragment shader. And as it turns out, that is enough to determine the material! We just fetch the voxel material at that position.

For blocks, their voxel data is in the block palette, and the current block's index in that palette is passed via push constants.
For models, we store their voxel data each as a separate 3D image and bind it for every model draw call (or once for WGPU backend and batched drawcall).

At this point, we have rasterized all blocks and models into our GBuffer.

There are some small visual features that I just really wanted to implement, so here we are:

Foliage: I thought that since the intention is to construct the world out of blocks, same can work for foliage - we don't need to store positions of individual foliage objects. So, foliage is generated entirely by shaders (mostly the vertex shader). We drawcall N instances of M vertices, where n ⊂ (0...N) corresponds to an index of a blade, and m ⊂ (0..M) corresponds to the index of a vertex in that blade.
Actually, 1 instance of (M+1)*N vertices is enough - we render blades as a triangle strip (to avoid doing same work multiple times), and we could insert NaN vertex between strips to prevent the GPU from rendering triangles connecting them, but we might as well use GPU built-in instances - just saves GPU few thrown away NaN triangles.
Liquids: We could do a DFT/FFT for some big resolution, but it's too slow. Instead, we do a "fake/manual" FFT, where we compute a few separate small DFTs and combine them with different scales and amplitudes (I call them LOD levels). The height at a point p is ∑ᵢ Lodᵢ(p) * Scaleᵢ.

Compute

There is not much Lum does in compute workload - updating data structure to represent models, update radiance field and update grass & water states.

At this point, all CPU parts of "Copy-on-write" happened - we "allocated" (which is just memorizing which indices are free and which are taken) temporary dynamic blocks on CPU. In this stage we actually copy data from static blocks to allocated dynamic ones (and for empty (air) blocks too, but for performance we "clear" (memset) corresponding subimages with 0). After that, for every requested model draw, we submit compute dispatch, that reads model-space voxel and writes it to corresponding world voxel - determines in which block it happens (reads world data), then indexes into block palette and determines actual voxel (previously we made sure all blocks that are occupied by models have been allocated, so we are guaranteed to not write static blocks).

Now, with updated data structure, it's time to update radiance field.

Radiance is per-block lighting probes, which describe luminance in every direction with some encoding (currently for performance tuning there is no direction, LOL). For shading a pixel, we interpolate between multiple adjacent probes to estimate light coming from all directions to this pixel (note, that we don't really care for it to be ReAlIsTiC, we only need it to look good).

So, we submit a compute shader, NxMx1, where N is number of radiance update requests, and M is number of rays per request. Currently it's 64 (double warp size for swap, you probably know this). In each thread, we send a ray into random (uniform with some randomness) direction, raytrace it and downsample result to store in radiance probes (currently it has no direction, so just average across threads. When I'll make it 6-directional, there will be mixing to 6 different directions) and mix it with what is currently in radiance field (accumulating across time).

My raytracing is actually ray marching - making small steps and checking collisions. I tried "precise" raymarchers (based on famous paper from 1987), in multiple different variations (branches/masks/partially unrolled/), I tried distance fields, I tried bit arrays instead of voxel for "checking" if voxel is non-empty, but the fastest good looking solution turned out to be simple fixed-step raymarching with rollback on collision for precise hit (and with manual "jumps" over air (empty) blocks).

Details about actual material light processing don't really matter - it's just a visual thing, you can use whatever you want. I like primitive shading model with roughness, color and emmitance.

As optimization, for tracing probe rays, instead of doing it until hit / long distance, you can do it for a few blocks, and if not a hit, inherit light from end point - I call it light propagation (it is another way to accumulate over time). It also creates cool effect of visible light waves.

Shading

Modern games think that rendering in full-res is slow, so they shade in low-res and then upscale. However, upscaling can be slow, and subpass optimizations are so fast that this tradeoff is almost never worth it for a renderer like this. All of Lum's shading happens in a single render pass, and that is a key reason why it can run on laptops. The frame image we render to potentially never even leaves the GPU's on-chip memory.

The shading render pass order is:
"diffuse" light shading → ambient occlusion → glossy reflections → volumetrics → tonemapping to swapchain.

diffuse reads the GBuffer (also potentially loaded only once - its subpass input), samples radiance field + sun lightmaps and produces "diffuse" shading color of pixel
ambient occlusion samples depth buffer of pixels nearby (so it's not purely subpass input sadly, since we need adjacent pixels. But it's filled in previous renderpass, so we can safely do it) and darkens pixel for "corners"
glossy reflections shader uses same technique as radiance field updater, but with slightly different settings (and, as i said it is culled with cheap shader marking pixels for processing with stencil mask)
volumetrics sample 3D perlin noise as density and traversal thickness of volumetric in corresponding pixel to calculate total obfuscation, and then blend using it as alpha
tonemap does all per-pixel color processing - remapping color range, some filters, and outputs directly to swapchain image