(Cross posted to realtimecollisiondetection.net)
If you must support immediate mode rendering in your pipeline, here are some tips for maximizing performance.
Make your primitives as high level as possible
- Prefer baked shapes to procedural primitives.
- Prefer procedurals to strips or point batches
- Prefer strips or point batches to quads, quads to tris.
For example, HUDs are often drawn with immediate or immediate-like rendering. If you can, bake the HUD so you can submit it as a single primitive. If the HUD has a highly variable part, try baking all of it except for the variable part, and have that have a fixed data size so you can just blit in what you need.
If you must support submission of arbitrary lists of primitives, collate the primitives into baked buffers; don't submit individual primitives to the driver except as an extreme last resort (even in OpenGL).
Treat text blocks and particle systems as single primitives. Maintain a single buffer or set of buffers, and recycle it. There will be an optimal size for this buffer that is a tradeoff between your application's requirements and hardware/driver performance. Profile often.
If you have a shape that you will re-use for several frames, bake it.
If you frequently use some shape like a disc, parameterize it, and blit it into your render stream as a high level disc primitive.
Use the notion of materials, never the fixed function pipeline (don't use a model of the FFP either!)
- The driver will maintain an internal state machine tracking your operations versus its internal state which represents a virtual model of the fixed function pipeline. In general all this state tracking is expensive.
- You need to minimize the state changes that are costly - and you will need to measure the hardware to find out what is actually slow. Don't trust what you read in a book or found on the web. It changes frequently.
- Create immediate mode materials that are blocks of expensive state change. Submit these blocks to the pipeline in chunks; don't do it in dribs and drabs.
- Order your renders to minimize invocations of these blocks. For typical immediate modes, you might be able to boil these materials down to a small set such as "unlit, untextured", "unlit, textured", "lit, textured", "shiny lit, textured", "unlit, alpha transparency".
- You also need to know the cost of a texture bind. Depending on whether texture bind or other state change is more expensive you will need to order your draws by texture then sub-order by material, or material then sub-order by texture.
Create an "unbind" material that creates a state change block that undoes the immediate material bind so that you don't mung things up for the rest of your pipeline.
- All the materials should touch all of the same state.
- Only issue the unbind material before you revert to non-immediate rendering.
- Assign each material a number. Order the numbers such that moving from one material to the next in sequence touches the fewest pipeline states. (You'll see why when we get to the keys.) Here, we are relying on the driver internal state tracking to help us minimize the cost of material binding.
Figure out what restrictions you can live with, and pick your restrictions for speed.
- If your application can get away with a text draw, and textured screen facing quads, support only that, and make it screamingly fast.
- So many things you think you need immediate for, such as manipulator axis, you can trivially do another way with baked geometry. If you can bake it, do!
- The immediate pipeline is your last resort.
Divide your rendering into phases. These phases are highly dependent on what you need your render engine to do to support your game. Typical phase breakdown might be
- z-depth pre-pass, non-alpha geo, alpha-geo, post-passes, composite, HUD
Assign a key to each batch of immediate renders. A typical 32 bit key might look like this (where 0 is the MSB):
|0-3:phase||4-11: material||12-16: texture||17-31: rough z|
- highest order bits to indicate render phase
- a few bits to indicate material
- a few bits to indicate a texture page in VRAM
- the rest of the bits for rough z-order. If you want to use early z-depth, make low numbers near the camera. If you need back to front ordering for painters', then use high numbers near the camera
Use sorted submission lists using the key and merge sort during just in time final push to hardware as discussed earlier in the Input Latency thread. If you can keep the submission lists persistent from frame to frame, a radix sort works well for the individual lists before the merge sort. I am assuming that the rest of you render engine is similar broken into phases in order that the immediate lists can be interleaved with other incoming draws from other threads.
Note that not all immediate rendering needs to be accumulated in submissions. The non-alpha geo phase in particular can be submitted interleaved with regular rendering for minimal latency. Unless you want to insert fences and syncs into your pipeline, all other order dependent phases will need to be accumulated into buffers. Fences and syncs can help you minimize RAM usage because the pipeline can demand that the engine cough up render requests at the right time, but you run the risk of stalling everything.
On some architectures, you can chain in render primitives without copying the data into a GPU queue; in that case you're golden because you can just make DMA thread through all your drawing. The major caveat there is to make sure the data isn't discarded or overwritten until the DMA has chased past its end. With careful structuring, you can save large amounts of memory for things like water surfaces if you can queue up the next frame of simulation for the precise moment the DMA is done with the last frame's vertices.