Indirect rendering in DirectX 12 addresses CPU bottlenecks caused by excessive draw calls by moving command recording from the CPU to the GPU. Instead of the CPU issuing individual draw calls for each sub-mesh, the CPU creates a command buffer containing GPU addresses of data, and the GPU executes these commands directly. This technique involves creating command signatures (similar to root signatures) that define what commands can be in the buffer, grouping render items by Pipeline State Object (PSO) to minimize state changes, and using executeIndirect to dispatch commands. The approach significantly reduces CPU overhead, especially in debug builds, while maintaining the same rendering quality as direct rendering.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Game Engine Programming 088 - DirectX 12 Draw indirect aka indirect renderingAdded:
Hello everyone and welcome to the game engine programming series where we write a game engine from scratch. In the last episode, we improved material handling in the engine so that objects are rendered correctly when we assign materials to different parts of a mesh.
Now we are ready to add the ability to actually assign textures and material properties to any 3D model in the editor. However, I noticed another issue that bothered me enough to postpone working on materials and fix it first.
To demonstrate the problem, have a look at the frame rate in this scene. We can see that it's being rendered four times at a stable 60 frames pers, which is great. Now, let's replace the lab model by the Sponsa scene, which we all know and love.
Now all of the sudden the frame rate has dropped to 15 fps. Granted, I'm running the editor in unoptimized debug build and this doesn't happen in release build, but it is still an indicator that something isn't quite right. So we have to fix this and therefore we need to guess what's causing it. At first one may think that this is because of the number of polygons in the spunza model.
However, looking at the file size, we can see that the lab scene is almost 20 times larger. So, I don't think that the GPU is the bottleneck here.
Next, we see that the Spunza model consists of a lot of sub meshes. Here, we can see that there are almost 400 sub meshes, and we know that each sub mesh results in a draw call in the engine. So in order to render four view ports, we are making about 1,200 draw calls, which could really choke the CPU, especially in debug build. To make it even worse, we are using two pass rendering, which doubles the number of draw calls to 2400.
On top of that, we are also calling various API functions, which add even more cost.
Fortunately, with modern APIs, there are several ways available for solving this problem.
One way is to move draw calls to the GPU, which oddly enough is called indirect rendering. So when you tell the CPU to tell the GPU how to render, it's called direct rendering. But when you directly tell the GPU what to draw, it's called indirect rendering. Anyway, to do this, we have to create a command buffer. And instead of recording commands on the CPU, we let the GPU do it. Before filling in the buffer, we have to tell the GPU what's in the buffer. Therefore, we create a command signature. This is very similar to how we make a root signature. So, I'll put them close together. We'll implement this function in a bit.
First, I'll add a helper class similar to what we have for root signatures.
This class helps us fill in the indirect argument descriptions, which is similar to root parameter descriptions.
In case of root signature they are called root parameters and for command signatures they are called indirect arguments.
here. We can see all the types that can be used. Each command signature must end with a draw or dispatch argument in order to execute on the GPU.
Next, we fill in how many arguments are in the buffer. This is done using a command signature description. Here I add a helper class for this as well.
Now we are going to implement this function that creates the command signature from the description.
We add the implementation right after the create root signature function in D3D12 helpers.cpp.
As we can see, this is done simply by calling this API function.
Okay. Now that we can define what commands can be in a buffer, we have to make a data structure for the buffer itself. We are completely free to decide what we put in the buffer as long as it agrees with the command signature that goes with it. Our command buffer contains the GPU addresses of the data that we sent to the shader. In addition, it contains the index buffer view of the sub mesh and the draw indexed argument which as I mentioned earlier must be used in order to execute the command buffer. We add two new functions that we can call for indirect rendering.
Therefore, the old CPU path will remain available.
Before filling the command buffer, we have to group our sub measures by PSO. I'll add a few data members to the G-PASS cache for this purpose. We'll see how they are used in a minute.
We also need a pointer to the command signature for each sub mesh.
We get this pointer as part of the material cache. So it needs to be included in this data type.
Actually, let's make sure we create the signature before using it in the G-pass.
This happens right after we create the root signature.
So, let's add a new array that contains command signatures.
We'll always have the same number of command signatures as root signatures.
And like I said, we create the command signature right after creating the root signature.
Again, this is very similar to how we fill in root parameters except we have two extra elements in the array. One for the index buffer view and one for the draw indexed argument.
at the end. We used the helper class to create the command signature.
And finally, we add it to the array of command signatures.
We also release them at shutdown and clear the array.
This should be an array of pointers, of course.
Okay, now we get the command signature along with other material data in the cache. Going back to the G-pass cache, we resize the arrays we just added to match the number of render items and set the pointers within the cache buffer as well.
In order to use a command buffer, we have to actually create a buffer resource on the GPU. This class will take care of that. Here we have a CPU accessible buffer resource and an array of actual commands that we fill in for each sub mesh to be rendered. Then we simply copy the content of this array to the GPU buffer.
Note that this buffer is twice the size of the commands array because we use half of it for the depth prepass and the other half for the G-pass.
As I said, the buffer is CPU accessible.
So, we map it to system memory and remember its address.
Then we add functions to upload commands to the first or second half of the buffer depending on render path. We also add a couple of accessor functions.
Finally, we create an instance of this class for each frame buffer. Currently, we are using three frame buffers, so there will be as many buffers.
Now, it's time to group render items by PSO. This way, we avoid changing the GPU state more often than necessary.
In this function, we clear the flags and grouped indices.
The purpose of the flags array is to mark items that already have been added to a group so that we can skip them.
It's also used to mark the first item in a group that uses a different PSO.
We use the lower four bits of the flags by for the G-pass and the higher four bits for depth prepass. So basically we use half of each bite also known as a nibble for each pass.
The first bit in the nibble is used to flag grouped items whereas the second bit is used to mark the first item that's using a different PSO.
The purpose of GPASS and depth grouped indices is to record the item indices within the cache that belong to the same group and therefore use the same PSO.
Oh, So we run this code twice and for each pass we loop through all items until all of them have been assigned to a group.
If we didn't exit this for loop that means that we came across another item that uses a different PSO. From this point, we mark every item that hasn't been marked before and is using the same pipeline state. We also add its index to grouped indices array.
This algorithm is linear in case every object uses the same PSO and becomes quadratic if every object uses a different PSO. In most practical cases, I would expect the number of PSOs to be much smaller than the number of objects.
So this should be reasonably fast. You could also use an unordered map to make this more linear in case you expect the number of PSOs to be large. At the end, we record the number of PSOs used for each pass.
Next, we need functions that fill in the command buffer for each render pass.
starting with depth command buffer, we get the index of each item within the cache.
If the item is the first in the group to use a new PSO, we record the PSO's index as well.
Then we simply fill in the command buffer for each item.
At the end, we copy the buffer to the GPU by calling upload depth commands function.
Recording commands for GPAS is the same except we also need to provide the lighting information.
make sure that you change everything from depth to G-pass in this function.
We release the command buffers when shutting down.
Now we are ready to implement the functions for indirect rendering.
In depth prepass function we call prepare render frame as before.
Then we get the command buffer. Resize it to match the number of items and group the items by PSO. This will figure out the number of PSOs for both depth prepass and G-pass. We can use this to allocate a small array on the stack where we record PSO indices.
Calling record depth command buffer will write to this array as well as record the commands. Obviously, at this point, we have enough information in order to render groups of items per PSO. For each PSO, we get the cache index of the first item that's using it. This way, we can get the root signature and PSO of the whole group.
Also note that we use the same topology type for all items in the group.
Remember that the primitive topology is part of the pipeline state subobject stream and therefore different topologies would result in different PSOs. So it's guaranteed that items within a group use the same topology type. We can calculate the number of items in a group from the indices at which each group starts. This index is in new PSO indices.
Next, we set the root signature, the pipeline state, and the primitive topology. Then, all we have to do is call execute indirect with the command signature, number of commands, the command buffer, and the offset within the buffer where the command arguments start for this group.
The offset is the index of the group's PSO times the size of command arguments block. Recall that we use the second half of the buffer for depth prepass commands. So in case of depth prepass the offset is with respect to the second half of the buffer.
The function for GPASS is the same except we have to use G-pass data.
Also, we use the first half of the command buffer.
Okay, we can still build and run the editor using the old render path. All we have to do in order to use indirect rendering is calling the new functions in the 3D12 core.cp.
I'd like to be able to toggle between the two rendering methods. So, I'll add a new compile time definition for turning indirect rendering on and off.
So if this is set to one, we call the indirect functions.
Need to move these lines before calling render functions.
Okay, as we see here, nothing changed except now we are using indirect rendering. In order to prove that this is the case, I'll again replace the lab building with the sponsor model.
And as we can see, we are still rendering at 60fps, which means that the CPU bottleneck is removed.
And we also don't have any live objects after shutdown, which means that all resources have been freed correctly.
It's always a good idea to enable GPU validation to see if there are any issues that we are not aware of. And sure enough, there is this error which actually doesn't have anything to do with indirect rendering. It's telling us that we are trying to index outside of the bounds of the resource heap where we keep our texture resource descriptors and we are apparently doing this in the post-process shader. So let's have a look there.
Here we are trying to sample from the specular BRDF texture to render the skybox. But of course, we are not using this in the editor yet. So this index is invalid. I'll disable the skybox rendering for now till we have support for ambient lights in the editor.
Now we get another error which does have to do with our command buffer and it's telling us that GPUbased validation will be less accurate when we change the index buffer view using indirect rendering. Of course, we have to do this in order to set the index buffer for each sub mesh. And I don't understand why it's this way. So, I need to research it a bit more.
In the meanwhile, we can disable this particular error by setting a filter for it in the debug layer. This is done by defining an array of all error messages that we would like to disable and adding it to info's filters. There.
These are not errors. It's just IntelliSense having special needs again.
And now we don't get any errors anymore.
And everything seems to be working as intended.
Let's disable GPU validation again so it's not slowing down the renderer.
Okay, that's it for today's video. Thank you so much for joining me and I'll see you next time.
Hey, hey, hey.
Related Videos
Agentforce NOW AMA: Build with React and Salesforce Multi-Framework
SalesforceDevs
490 views•2026-05-28
How agent o11y differs from traditional o11y — Phil Hetzel, Braintrust
aiDotEngineer
450 views•2026-05-28
WEB TECHNOLOGIES UNIT-2 | Degree 4th sem BCOM Computers web technologies unit-2 full explanation💯✅
LearnwithSahera
1K views•2026-05-29
More tests are always better? How to use AI to identify tests that bring little value
Alliance4Qualification
335 views•2026-05-29
Search Algorithms Explained in 60 Seconds! 🤖💨
samarthtuliofficial
218 views•2026-06-01
People of Game of Thrones using JavaScript DOM
AltCampus
296 views•2026-05-30
Introduction to Problem Solving Part - 1 | Lecture 1 | Intermediate DSA
ascensionix
107 views•2026-05-29
So What's Odin Lang Even Good For
TechOverTea
131 views•2026-06-01











