Triangle Strips
In real-time graphics, a mesh usually consists of triangles. Each triangle consists of three indices to vertex attributes. An index buffer
V=\left[V_0, V_1, V_2, V_3, V_4, V_5, V_6, V_7, V_8 \dots\right]
represents the connectivity of a triangle mesh.
Commonly, a GPU interprets V as a *triangle list*, where triples of indices in V form the triangles:
\left(V_0, V_1, V_2\right), \left(V_3, V_4, V_5\right), \left(V_6, V_7, V_8\right) \dots
Or here in a concrete example:
However, with three indices per triangle, their storage requirement is relatively high.
A less memory intense way to interpret V are triangle strips, where a triangle reuses an edge of the preceding triangle:
\begin{array}{llll} \left(V_0, V_1, V_2\right), & \left(V_3, V_2, V_1\right), & \left(V_2, V_3, V_4\right), & \left(V_5, V_4, V_3\right),\\ \left(V_4, V_5, V_6\right), & \left(V_7, V_6, V_5\right), & \left(V_6, V_7, V_8\right), & \dots \end{array}
Note that we inverted the vertex order in every second triangle to keep the winding order intact.
Now, each triangle in the strip requires only one new index, except for the first one.
Thus, using strips results in a 3:1 compression ratio.
Triangle strips describe a path over the triangles of a mesh. The previous example was of an alternating triangle strip, where the strip alternates between reusing the right and the left edge of the previous triangle. As this is not very flexible, we can introduce another bit flag per triangle that encodes which edge is reused, which we call the L/R flag. This is called a generalized triangle strip or GTS and is what we use for our meshlets. Our previous example encoded as a GTS could look like this:
Note that we also ordered the vertices of the meshlet, so that they appear in incrementing order in the strip. As this way, the first indices \left(0, 1, 2\right) are mandatory, we don’t need to store them.
To entirely represent a mesh, or in our case a meshlet, multiple strips are typically required. To describe multiple strips as a single one, we need a way to encode a strip restart. This can either be done by adding another restart bit per triangle, or by intentionally adding degenerate triangles to the strip.
A degenerate triangle has two or more identical vertex indices, thus its area is zero. As they will not change the result of the rendered image, the rasterizer of a GPU can just ignore them. For example, the index buffer for an alternating strip
\left[V_0, V_1, V_2, V_3, \textcolor{red}{V_3}, \textcolor{red}{V_4}, V_4, V_5, V_6\right]
with the duplicated vertices \textcolor{red}{V_3}, \textcolor{red}{V_4} produces four degenerate triangles:
\begin{array}{llll} \left(V_0, V_1, V_2\right), & \left(V_3, V_2, V_1\right), & \left(V_2, V_3, \textcolor{red}{V_3}\right), & \left(\textcolor{red}{V_4}, \textcolor{red}{V_3}, V_3\right),\\ \left(\textcolor{red}{V_3}, \textcolor{red}{V_4}, V_4\right), &\left(V_5, V_4, \textcolor{red}{V_4}\right), & \left(V_4, V_5, V_6\right). \end{array}
As you can see, we successfully ended a strip on a triangle \left(V_3, V_2, V_1\right) and started a new one with triangle \left(V_4, V_5, V_6\right). Using degenerate triangles instead of bit flags comes with the benefit that no additional logic is required in the decompression stage. The rasterizer will just ignore all degenerate triangles.
When encoding a meshlet in a GTS, the number of triangles grows when requiring a restart. Thus, in very rare cases, the number of triangles can exceed the 256 limit. In such a case, we choose to split the meshlet in two.
Finding good strips through a meshlet that only require a low number of restarts is a non-trivial problem. For a good tradeoff between run-time performance and quality, we recommend to use the enhanced tunneling algorithm. In our paper “Towards Practical Meshlet Decompression”, we describe an optimal algorithm, but that takes some time to compute.
Strip decoding
As established in the previous section, a L/R flag per triangle marks which edge of the previous triangle is reused. As all triangles are decompressed in parallel, a thread cannot access the decompressed previous triangle. Instead, each individual thread must find the correct indices from earlier in the strip. See the next figure for reference: the longer the strip spins around vertex in a fan-like manner, the further back the required index lies.
To find out how far exactly we have to go back in the strip, we count (starting with 1) how many steps we have to go back in the L/R bit mask until a change of flag occurs. In our example, triangle 7 requires the index of triangle 4, which is 0. To do this search in linear time, we can use the bit operation firstbithigh (or firstbitlow) in hlsl or findMSB (or findLSB) in glsl. Note that the bit mask has to be shifted and inverted accordingly.
This can be implemented in hlsl like the following:
uint LoadTriangleFanOffset(in int triangleIndex, in bool triangleFlag)
{
uint lastFlags;
int fanOffset = 1;
do
{
// Load last 32 L/R-flags
lastFlags = LoadLast32TriangleFlags(triangleIndex);
// Flip L/R-flags, such that
// 0 = same L/R flag
// 1 = different L/R flags, i.e. end of triangle fan
lastFlags = triangleFlag ? ~lastFlags : lastFlags;
// 31 - firstbithigh counts leading 0 bits, i.e. length of triangle fan
fanOffset += 31 - firstbithigh(lastFlags);
// Move to next 32 L/R flags
triangleIndex -= 32;
}
// Continue util
// - lastFlags contains different L/R flag (end of triangle fan)
// - start of triangle strip was reached (triangleIndex < 0)
while ((lastFlags == 0) && (triangleIndex >= 0));
return fanOffset;
}
We have to use a loop here, because a fan-like configuration could be longer than 32 triangles. In most cases, this loop will only run once. In the worst case, the whole strip is actually a fan, so the loop will run 256/32 = 8 times (256, because it is the maximum amount of triangles a mesh shader can output and 32, because we store one bit per triangle in a 32-bit dword). Inside the loop, we load the L/R bits in a way that the most significant bit is the flag of the current triangle, the second most significant bit is the form the previous triangle, and so on:
// 256/32 = 8 dwords for flags and 256 / 4 = 64 dwords for 8 bit indices
groupshared uint IndexBufferCache[8 + 64];
uint LoadLast32TriangleFlags(in int triangleIndex)
{
// Index of DWORD which contains L/R flag for triangleIndex
int dwordIndex = triangleIndex / 32;
// Bit index of L/R flag in DWORD
int bitIndex = triangleIndex % 32;
// Shift triangle L/R flags such that lastFlags only contains flag prior to triangleIndex
uint lastFlags = IndexBufferCache[dwordIndex] << (31 - bitIndex);
if (dwordIndex != 0 && bitIndex != 31)
{
// Fill remaining bits with previous L/R flag DWORD
lastFlags |= IndexBufferCache[dwordIndex - 1] >> (bitIndex + 1);
}
return lastFlags;
}
As you can see here, we do something we recommended in the previous posts of this series: make use of the group shared memory. In detail, instead of every thread having its own copy of the whole compressed strip, or loading that data from global memory multiple times, we load the data once and cache it in group shared memory. This improves run-time performance.
Finally, the whole triangle is loaded like this:
uint3 LoadTriangle(in int triangleIndex)
{
// Triangle strip L/R flag for current triangle
const bool triangleFlag = LoadTriangleFlag(triangleIndex);
// Offset in triangle fan
const uint triangleFanOffset = LoadTriangleFanOffset(triangleIndex, triangleFlag);
// New vertex index
uint i = LoadTriangleIndex(triangleIndex);
// Previous vertex index
uint p = LoadTriangleIndex(triangleIndex - 1);
// Triangle fan offset vertex index
uint o = LoadTriangleIndex(triangleIndex - triangleFanOffset);
// Left triangle
// i ----- p
// \ t / \
// \ / \
// \ / t-1 \
// o ----- *
//
// Right triangle
// p ----- i
// / \ t /
// / \ /
// / t-1 \ /
// *------ o
return triangleFlag ?
// Right triangle
uint3(p, o, i) :
// Left triangle
uint3(o, p, i);
}