Search Unity

BatchRendererGroup sample: Achieve high frame rate even on budget devices

October 3, 2023 in Engine & platform | 15 min. read
A side-by-side look at a still image of the BatchRendererGroup (BRG) shooter sample in action on a horizontal smartphone next to code for the sample.
A side-by-side look at a still image of the BatchRendererGroup (BRG) shooter sample in action on a horizontal smartphone next to code for the sample.
Share

Is this article helpful for you?

Thank you for your feedback!

In this post, we describe a small shooter game sample that animates and renders several interactive objects. Many demos are made for high-end PCs only, but the goal here is to achieve a high frame rate on a budget phone using GLES 3.0. This sample uses BatchRendererGroup, Burst compiler, and the C# Job System. It runs in Unity 2022.3 and doesn't require entities or entities.graphics DOTS packages.

Let’s get started.

Introducing the sample

Let’s jump right into what the sample is. This sample is running at a steady 60 fps on a budget 2019 Samsung Galaxy A51 (using a Mali G72-MP3 GPU). The graphics API is set to GLES 3.0.

You can study the code and try it on your favorite platform by downloading the project from GitHub. You’ll only need stock Unity 2022.3.

In this post we mainly focus on BatchRendererGroup and the sample class BRG_Container.cs. You can also study the animation and physics code in the BRG_Background.cs and BRG_Debris.cs classes.

Setting the scene

Let’s explore what we see before going deeper into how to make it.

  • The background floor is constructed from many cubes. All boxes are animated to move up and down.
  • The main ship moves horizontally on the screen and shoots missiles at colored spheres. (You can shoot missiles faster by tapping the screen.)
  • When a missile flies over the floor, a magnetic field slightly lifts and highlights the floor cells. It also throws ground debris into the air.
  • When a missile hits a sphere, it explodes into colored debris.
  • When debris hits the floor, the colliding cell on the floor flashes white. The more debris that hits a cell, the more the cell’s color darkens. In addition, the weight of the debris causes indents in the ground.

Rendering

Both the floor cells and debris are made of cubes. Each cube has a different position and color. We want to animate and manage everything using the CPU to make the interactions between floor and debris easier. (Debris isn’t just a cosmetic visual, so it can’t be done with the GPU only.)

For rendering, we aren’t creating a GameObject per item to avoid an unnecessary performance hit on a low-end mobile device. Instead, we’re using the newly introduced BatchRendererGroup API.

Why not use classic Graphics.DrawMeshInstanced?

Graphics.DrawMeshInstanced is a convenient and fast way to render many similar meshes at different positions. However, it has the following limitations compared to the BatchRendererGroup API:

  • It requires providing a managed memory array with matrices, so you may get garbage collection. Also, inverted matrices are CPU-computed, even if the shader doesn’t need it (for instance, with URP/unlit).
  • If you want to customize any property other than the obj2world matrix (like having one color per instance), you need to provide your own custom shader either by writing it from scratch or using Shader Graph
  • Matrix or custom data must be uploaded to GPU memory at each draw. You can’t have persistent GPU memory data with Graphics.DrawMeshInstanced. Depending on context, this could be a huge performance hit.

What is BatchRendererGroup?

BatchRendererGroup (or BRG) is an API that efficiently generates draw commands from C# and produces GPU-instancing draw calls. Since it doesn’t use managed memory, you can also generate commands using the Burst compiler.

ProsCons
Ability to quickly generate DrawInstanced commands from Burst jobsYou have to generate optimal batches of draw commands yourself
Use persistent large GPU buffer to store any custom properties per instanceYou must manage GPU memory and custom properties offset allocation yourself
Supported on a wide range of platforms, including OpenGLES 3.0 and above 
Compatible with standard SRP shaders (lit and unlit). No need to write custom shaders 
Pros
Ability to quickly generate DrawInstanced commands from Burst jobs
Use persistent large GPU buffer to store any custom properties per instance
Supported on a wide range of platforms, including OpenGLES 3.0 and above
Compatible with standard SRP shaders (lit and unlit). No need to write custom shaders
Cons
You have to generate optimal batches of draw commands yourself
You must manage GPU memory and custom properties offset allocation yourself

Tip: The entities.graphics package is made to render entities (ECS package) and is built on top of BRG. entities.package does all GPU memory management and optimal draw commands creation for you. We’re not using ECS in this sample, so we’ll directly drive BRG.

BRG shader data model

BRG uses a specific GPU data layout and dedicated shader variant. The shader variant can fetch data from the standard constant buffer (UnityPerMaterial) or from a custom, large GPU buffer (BRG raw buffer). It’s up to you to manage how you store your data in the raw buffer, which is a Shader Storage Buffer Object (SSBO, or byte address buffer). The default BRG data layout is the structure of arrays (SoA) type.

Properties per instance – or not

You can instantiate any properties of a material without having to create a custom shader. In the sample, we want to instantiate obj2world matrix (to position cubes), world2obj matrix (for lighting), and BaseColor per box instance (because each floor cell or debris has its own color).

All other properties are the same for all cubes (e.g., smoothness value), and you can describe which properties will have custom values per instance using metadata.

BRG metadata

The BRG metadata is an optional 32-bit value you can set per shader property. It tells the shader code how to load the property value from GPU memory and from where. Bits 0–30 define the offset of the property within the BRG raw buffer, and bit 31 tells whether the property value is the same for all instances or the offset is the beginning of an array, with one value per instance.

The exact meaning of BRG metadata also depends on the shader property type. Let’s sum up all possibilities:

Shader propertiesBRG metadata not definedBRG metadata defined, bit 31 clearedBRG metadata defined, bit 31 set
Any “per material” property (e.g. “BaseColor”)standard UnityPerMaterial constant bufferstandard UnityPerMaterial constant bufferBRG raw buffer, array (one value per instance)
obj2world, world2obj, MatrixPreviousM, MatrixPreviousMUndefined. If the shader variant is using these properties, you should define metadataBRG raw buffer, same value for all instancesBRG raw buffer, array (one value per instance)
LODFade RenderingLayer MotionVectorsParams WorldTransformParams
Automatically provided by Unity (you can’t override values)
SHAx, SHBx, SHC ProbesOcclusionGlobal SH automatically provided by Unity, same value for all instancesBRG raw buffer, same value for all instancesBRG raw buffer, array (one value per instance)
KEY: Light Grey cells represent a different value for each instance, Blue cells represent a single value for each instance.
BRG metadata not defined
Any “per material” property (e.g. “BaseColor”)standard UnityPerMaterial constant buffer
obj2world, world2obj, MatrixPreviousM, MatrixPreviousMUndefined. If the shader variant is using these properties, you should define metadata
LODFade RenderingLayer MotionVectorsParams WorldTransformParamsAutomatically provided by Unity (you can’t override values)
SHAx, SHBx, SHC ProbesOcclusionGlobal SH automatically provided by Unity, same value for all instances
BRG metadata defined, bit 31 cleared
Any “per material” property (e.g. “BaseColor”)standard UnityPerMaterial constant buffer
obj2world, world2obj, MatrixPreviousM, MatrixPreviousMBRG raw buffer, same value for all instances
LODFade RenderingLayer MotionVectorsParams WorldTransformParamsAutomatically provided by Unity (you can’t override values)
SHAx, SHBx, SHC ProbesOcclusionBRG raw buffer, same value for all instances
BRG metadata defined, bit 31 set
Any “per material” property (e.g. “BaseColor”)BRG raw buffer, array (one value per instance)
obj2world, world2obj, MatrixPreviousM, MatrixPreviousMBRG raw buffer, array (one value per instance)
LODFade RenderingLayer MotionVectorsParams WorldTransformParamsAutomatically provided by Unity (you can’t override values)
SHAx, SHBx, SHC ProbesOcclusionBRG raw buffer, array (one value per instance)
KEY: Light Grey cells represent a different value for each instance, Blue cells represent a single value for each instance.
Figure 1: using BRG metadata you can describe which properties have custom value per instance (like obj2world, world2obj, baseColor). All other properties have the exact same value for all instances (and still use classic UnityPerMaterial cbuffer as the data source).
Figure 1: Using BRG metadata you can describe which properties have custom value per instance (like obj2world, world2obj, baseColor). All other properties have the exact same value for all instances (and still use classic UnityPerMaterial cbuffer as the data source).

BRG culling and visibility indices

Unlike Graphics.DrawMeshInstanced, BRG uses a persistent GPU memory buffer. Let’s say you have 10 cube positions and colors in the raw buffer, but only cubes 0, 3, and 7 are visible. You only want to draw three cubes, but you need the shader to properly read the position and color of those cubes. To do that, BRG shader uses a small additional indirection. This visibility buffer is just an array of “int” you fill when generating draw commands. 

In this example, you need to fill an array of three ints with {0,3,7} and can then generate a BRG draw command of three instances.

Figure 2: The BRG shader variant always uses the visibility indirection to fetch data from the persistent raw buffer. This small visibility indirection buffer can be generated for each frame according to your needs.
Figure 2: The BRG shader variant always uses the visibility indirection to fetch data from the persistent raw buffer. This small visibility indirection buffer can be generated for each frame according to your needs.

The shader code to fetch for “baseColor” property looks like this:

if ( metadata_baseColor&(1<<31) )
{
    	// get the real index from the visibility buffer indirection
        	int visibleId = brg_visibility_array[GPU_instanceId];
        	uint base = (metadata_baseColor&0x7ffffffc);
        	uint offset = visibleId * sizeof(baseColor);
    	// fetch data from a custom array in BRG raw buffer
    	baseColor = brg_raw_buffer.Load( base + offset );
}
else
{
    	// fetch data from UnityPerMaterial (as usual)
        	baseColor = UnityPerMaterial.baseColor;
}
Go further than the sample: As you can instantiate any property of SRP shaders (unlit, simplelit, lit), all material properties have an “if metadata&(1<<31” branch. Even if you don’t need a custom smoothness value per instance, this has some performance cost. In the sample, we only want to instantiate baseColor. You can create a Shader Graph where only color will be defined as BRG instantiatable. So the generated code has the heavy data fetching indirection only for color property. Shader should run even slightly faster on a low end GPU.

Rendering floor cells

In our game sample, the floor is made of 32x100 cells, or 3,200. Each has a position, height, and color, and the cells scroll while the camera remains static. When a row scrolls out of the view, we inject a new row of 32 cells.

A new row of cells is inserted when a full row has scrolled out of the view. Random height and color are used for new cells. You can have a look at BRG_Background.InjectNewSlice() in the sample.
A new row of cells is inserted when a full row has scrolled out of the view. Random height and color are used for new cells. You can have a look at BRG_Background.InjectNewSlice() in the sample.

With 3,200 cells at any point in time, culling is not really necessary (all cells are always within the camera’s view). To position each cell, you need an obj2world matrix per cell, the invert matrix for lighting, and a color. To render the complete floor, we’ll use a single BRG draw command.

Rendering explosion debris

All debris have simple gravity physics and interact with floor cells. Everything is running on the CPU using Burst C# jobs
All debris have simple gravity physics and interact with floor cells. Everything is running on the CPU using Burst C# jobs

The sample’s debris is made up of small cubes, each one having a position, color, and rotation on its vertical axis. This is very similar to the floor cells. To do this, we created BRG_Container.cs. The class manages a BRG object to render floor cells or explosion debris. All physics animation and interaction is done with C# code using BRG_Debris.cs.

Unlike floor cells, the amount of debris varies across the frame. At initialization, you specify the maximum number of items to BRG_Container. In our sample, it’s 16,384 for debris (each explosion consists of 1,024 debris cubes) and we use async jobs to animate debris in a gravity field. When debris hits a floor cell, it interacts by digging into the ground.

BRG matrix format

To optimize GPU memory storage and bandwidth, BRG uses a float3x4 to store a matrix instead of float4x4. Keep in mind that a BRG matrix in the raw buffer is 48 bytes, not 64.

BRG matrix is 48 bytes only (ie three float4) to improve GPU bandwidth
BRG matrix is 48 bytes only (ie three float4) to improve GPU bandwidth

The raw buffer will look like this:

Figure 3: A 350 KiB SSBO raw buffer contains data for 3,200 instances, using the SoA layout.
Figure 3: A 350 KiB SSBO raw buffer contains data for 3,200 instances, using the SoA layout.

Tip: Debris raw buffer data looks similar to floor data as it also uses three custom properties (obj2world, world2obj, and color). The maximum number of items is 16,384 for debris, meaning a raw buffer of 112x16,384 bytes, or 1.75 MiB. Not all debris is rendered most of the time, depending on the number of debris cubes in existence at a given time.

Animating floor cells

We have a GPU GraphicsBuffer of 358,400 bytes. Since animation is done with the CPU, we also allocate a similar buffer in system memory (CPU can process data at full speed in system memory). Let’s call this second buffer a “shadow copy” of the GPU memory. C# code will animate the floor cells, using sin, and debris from the shadow copy. When animation is done, we upload the shadow copy buffer to the GPU using the GraphicsBuffer.SetData API.

Go further than the sample: Optimizing GPU rendering often means optimizing the amount of data. In our sample, we use standard and stock SRP shaders. That’s why we employed three float4 for the matrix and one float4 for color. You could go further, writing a custom shader to reduce the data size, or you could use a 32-bit floor cell height value.

If you wish to keep going, use the cell index to calculate its world position, then compute the matrix and invert matrix in the shader. Finally, use a 32-bit integer to store the color. At the end, upload 8 bytes per item instead of 112. This leads to a 14x speed-up during GPU data upload. It would imply rewriting the shader fetching code.

BRG BatchID

Any BRG draw command needs a MeshID, MaterialID, and BatchID. The first two are easy to understand, but BatchID is more subtle. Think of BatchID as “kind of a batch.” To render the floor, you need to register one kind of batch, defined as follows:

  1. “unity_ObjectToWorld” property is an array starting at offset 0 of the BRG raw buffer
  2. “unity_WorldToObject” property is an array starting at offset 153,600
  3. “_BaseColor” property is an array, starting at offset 307,200

Code to register this kind of batch at creation time will look similar to this:

    	int objectToWorldID = Shader.PropertyToID("unity_ObjectToWorld");
    	int worldToObjectID = Shader.PropertyToID("unity_WorldToObject");
    	int colorID = Shader.PropertyToID("_BaseColor");
    	var batchMetadata = new NativeArray<MetadataValue>(3, Allocator.Temp, NativeArrayOptions.UninitializedMemory);
 
                    	batchMetadata[0] = CreateMetadataValue(objectToWorldID, 0, true);   	// matrices
                    	batchMetadata[1] = CreateMetadataValue(worldToObjectID, 3200*3*16, true); // inverse matrices
                    	batchMetadata[2] = CreateMetadataValue(colorID, 3200*3*16*2, true); // colors
                    	m_batchId = m_BatchRendererGroup.AddBatch(batchMetadata, m_GPUPersistentRawBuffer.bufferHandle, 0, 0);

We get the m_batchId at creation time, and can then use it for each BRG draw command (so the shader knows exactly how to fetch data for that kind of batch).

Tip: BatchRendererGroup.AddBatch is not a rendering command. It’s used to register a kind of batch, for future rendering commands.

The devil’s in the details: GLES exception

So far, we can animate floor cells, upload the shadow copy system memory buffer to the GPU, and render all cells using a single DrawCommand of 3,200 instances.

This will work on most platforms: DirectX, Vulkan, Metal, and various game consoles, but not on GLES. The problem is that most GLES 3.0 devices can’t access SSBO during the vertex stage (i.e., the GL_MAX_VERTEX_SHADER_STORAGE_BLOCKS value is 0). So, when the graphics API is set to GLES, BRG will use a constant buffer, or UBO, instead to store the raw data.

This adds constraints: A constant buffer can be any size, but only a small part of it (a window) is visible at any given time when the shader is running. The window size depends on the hardware and driver, but a widely accepted value is 16 KiB.

Tip: In UBO mode, you should always use the BatchRendererGroup.GetConstantBufferMaxWindowSize() API to get the correct BRG window size.

Let’s see how our code changes if we want to run on GLES. For floor cells, the total amount of data is 350 KiB. We can’t do a single DrawInstanced(3,200) because the shader won’t be able to see 350 KiB at once. So, we have to split data within the UBO to maximize the amount of instances per draw, fitting into a 16 KiB block. One floor cell is 112 bytes (two matrices and one color), so you can fit 16,384 divided by 112, or 146 instances in a 16 KiB block. To render 3,200 instances, we will need to issue 21 DrawInstanced(146) and a last DrawInstanced(134).

Now, the 350KiB UBO will be split into 22 window blocks of 16KiB each, like this:

Figure 4: In GLES, the raw buffer is UBO (not SSBO). Data for 3,200 instances is split into 22 windows. Each DrawInstanced(146) will fetch data from a 16 KiB region. Note that the last window contains 134 instances only, which is why there’s a small gap between the last yellow, green, and blue region.
Figure 4: In GLES, the raw buffer is UBO (not SSBO). Data for 3,200 instances is split into 22 windows. Each DrawInstanced(146) will fetch data from a 16 KiB region. Note that the last window contains 134 instances only, which is why there’s a small gap between the last yellow, green, and blue region.

Tip: In UBO mode, each window offset should be aligned to BatchRendererGroup.GetConstantBufferOffsetAlignment(). Typical alignment values range from 4 to 256 bytes.

In GLES, because of the UBO and the 16 KiB windows, you need to register 22 BatchID in order to store the offsets of each window. The initialization code then needs a loop:

	 // Register one BatchID per 16KiB window, using the right offsets
    	m_batchIDs = new BatchID[m_windowCount];
    	for (int b = 0; b < m_windowCount; b++)
    	{
        	batchMetadata[0] = CreateMetadataValue(objectToWorldID, 0, true);   	// matrices
        	batchMetadata[1] = CreateMetadataValue(worldToObjectID, m_maxInstancePerWindow * 3 * 16, true); // inverse matrices
        	batchMetadata[2] = CreateMetadataValue(colorID, m_maxInstancePerWindow * 3 * 2 * 16, true); // colors
        	int offset = b * m_alignedGPUWindowSize;
        	m_batchIDs[b] = m_BatchRendererGroup.AddBatch(batchMetadata, m_GPUPersistentInstanceData.bufferHandle, (uint)offset,(uint)m_alignedGPUWindowSize);
    	}

Tip: To support GLES (UBO) and other Graphics API (SSBO) in the game sample, BRG_Container.cs sets some vars at initialization time. In SSBO mode, m_windowCount is 1 and m_alignedGPUWindowSize is the total buffer size. In UBO mode, m_alignedGPUWindowSize is 16 KiB and m_windowCount contains the number of 16 KiB blocks. (The 16 KiB value is for readability. Use GetConstantBufferMaxWindowSize() API to get the correct value.)

Uploading data

Once the CPU updates all matrices and colors in the system memory, you can upload the data to the GPU. This is done with the BRG_Container.UploadGpuData function. Because of the SoA data model, you can’t upload a single block of memory. For debris, the buffer is 16,384 items. In GLES mode, that means 113 windows of 16 KiB each if 16,384 debris are on screen.

But what if only 5,300 debris cubes are in a given frame? Because you have 146 items per window, this means the first 36 consecutive 16 KiB windows should be uploaded so you can use a single SetData (36x16 KiB). In the last window, only 44 debris cubes should be displayed. To upload 44 matrices, invert matrices and colors and use three SetData commands. At the very end, four SetData commands should be issued.

Up to four GfxBuffer.SetData commands are needed to upload N items.
Up to four GfxBuffer.SetData commands are needed to upload N items.

Tip: Even in SSBO mode, if the number of items is less than the max (for example, 5,300 debris over a max of 16,384), three SetData commands are required. You can take a look at BRG_Container.UploadGpuData(int instanceCount) for implementation details.

Main BRG user callback

The main entry point of BRG is the culling callback function you provide at creation time. The prototype looks like:

public JobHandle OnPerformCulling(BatchRendererGroup rendererGroup, BatchCullingContext cullingContext, BatchCullingOutput cullingOutput, IntPtr userContext)

Your code in this callback is responsible for two things:

  1. To generate all draw commands into the output BatchCullingOut struct
  2. To use (or not) information provided in the BatchCullingContext read-only struct within your own culling code

Note: The callback returns a JobHandle in case you want to launch an async job to perform these operations. The engine will use this to sync at the point the result is needed, so your command generation code won’t block the main thread.

BatchCullingContext contains information like camera matrix, camera frustum plans, etc. Basically, all the data you need to cull and generate fewer draw commands. In the sample, all objects fit in the camera view (floor cells and debris), so there’s no need to use culling code.

BatchCullingOutputDrawCommands struct contains various data, including arrays. It’s the user’s responsibility to allocate native memory for those arrays. The engine is responsible for releasing that memory once the data has been consumed (you’re allocating, Unity is responsible for releasing). Memory allocation should be Allocator.TempJob type.

	private static T* Malloc<T>(uint count) where T : unmanaged
	{
    	return (T*)UnsafeUtility.Malloc(
        	UnsafeUtility.SizeOf<T>() * count,
        	UnsafeUtility.AlignOf<T>(),
        	Allocator.TempJob);
	}

The first array you should allocate is the visibility int array. In the sample, as we assume everything is visible, we just fill the visibility int array with incremental values, like {0,1,2,3,4,...}.

Draw commands generation

A BRG draw command is almost a GPU DrawInstanced call. The most important array to allocate and fill is the BatchDrawCommand. Let’s say there are 4,737 debris cubes in the current frame.

m_maxInstancePerWindow is 146 in GLES mode. You can compute the amount of draw commands and allocate the buffer using ceiling value of m_instanceCount divided by m_maxInstancePerWindow:

int drawCommandCount = (m_instanceCount + m_maxInstancePerWindow - 1) / m_maxInstancePerWindow;
drawCommands.drawCommands = Malloc<BatchDrawCommand>((uint)drawCommandCount);

To avoid duplicating similar parameters into several draw commands, BatchCullingOutputDrawCommands has an array of BatchDrawRange struct. You can set up various parameters within BatchDrawRange.filterSettings, like renderingLayerMask, receive shadow flags, etc. As all draw commands will share the same rendering settings, you could allocate a single DrawCommandRange struct that will apply from draw command 0 and contains all drawCommandCount commands.

drawCommands.drawRanges[0] = new BatchDrawRange
{
	drawCommandsBegin = 0,
	drawCommandsCount = (uint)drawCommandCount,
	filterSettings = new BatchFilterSettings
	{
    	renderingLayerMask = 1,
    	layer = 0,
    	motionMode = MotionVectorGenerationMode.Camera,
    	shadowCastingMode = m_castShadows ? ShadowCastingMode.On : ShadowCastingMode.Off,
    	receiveShadows = true,
    	staticShadowCaster = false,
    	allDepthSorted = false
	}
};

Then, fill the draw commands. Each BatchDrawCommand contains a meshID, batchID (to know how to use metadata), and materialID. It also contains the starting offset in the visibility int array buffer. As we don't need any frustum culling in our context, we fill the visibility array with {0,1,2,3,...}. Then all draw commands will refer to the same {0,1,2,3,..} indirection so each BatchDrawCommand will use 0 as visibility array starting offset.The following code allocates and fills all needed draw commands:

drawCommands.drawCommands = Malloc<BatchDrawCommand>((uint)drawCommandCount);
int left = m_instanceCount;
for (int b = 0; b < drawCommandCount; b++)
{
	int inBatchCount = left > maxInstancePerDrawCommand ? maxInstancePerDrawCommand : left;
	drawCommands.drawCommands[b] = new BatchDrawCommand
	{
    	visibleOffset = (uint)0,	// all draw command is using the same {0,1,2,3...} visibility int array
    	visibleCount = (uint)inBatchCount,
    	batchID = m_batchIDs[b],
    	materialID = m_materialID,
    	meshID = m_meshID,
    	submeshIndex = 0,
    	splitVisibilityMask = 0xff,
    	flags = BatchDrawCommandFlags.None,
    	sortingPosition = 0
	};
	left -= inBatchCount;
}

Wrapping up: Dive deeper in the forums

Directly driving BatchRendererGroup requires some work. However, it works out-of-the-box without needing custom shaders or additional packages. In some situations, like having to render plenty of CPU simulated objects with custom instantiated properties, BatchRendererGroup is your best friend.

You can download the project from this repository.

You can also visit the forums to discuss about additional details on how we used C# job system and Burst compiler to handle all animations and interactions at full speed, even on a low-end CPU.

October 3, 2023 in Engine & platform | 15 min. read

Is this article helpful for you?

Thank you for your feedback!

Join a discussion on our Forums
Related Posts