Lost Planet Multithreaded Rendering

Greetz to visitors from Beyond3D!

Capcom has an internally developed tool suite used to ship Lost Planet, Dead Rising, and other games. It is currently being used to develop Devil May Cry 4 and BioHazard 5. The tool runs identically on PC, PS3, and X360; no small feat considering the lack of GUI toolkits on those platforms, and the limited memory resources. This article is based on an interview with Satoshi Ishida written by Osamu Nishikawa on Game Watch. If you can read Japanese, I recommend having a look at the article, it goes into fantastic amounts of detail on Lost Planet's rendering subsystems.

Christer Ericson has gone into some detail on the GPU specific portions of this article. A literal translation of the article can be found on Beyond 3D's forums. Beyond3D has also posted writeups - part 1 and part 2.

Screenshot (c) 2006 Capcom

The primary focus of their engine is to maximize parallelism. It divides tasks into three categories, "module", "loop", and "task".

  • Modules are things which must execute on the main thread. Modules include rendering, sound, collision, motion, simulation, path-finding, and AI.
  • Loops are independently running jobs that have little coupling with other game systems.
  • Tasks are small discrete jobs that are queued for execution on available execution threads.  Tasks execute every frame, and include player logic, enemy AI, fighting logic, the camera, lighting assignment logic, particles, and so on.

The task manager allows the traditionally serial game logic to be broken down in parallel. As each job thread consumes tasks, it pulls the next job from the queue. There are two queues, one of which contains tasks that must occur in order, and another queue of tasks which can occur in any sequence. This distinction ensures that data dependencies between tasks are not violated. The PC and X360 work as follows:

In that article, Ishida claims that in their experience with their game engine, one core on a game console is equivalent to roughly 2/3 of the performance of a Pentium 4 of the same clock speed. He claims his game engine gets roughly the equivalent performance on X360 as a dual core Pentium 4 Extreme Edition 3.2 GHz. He further claims that their system hides L2 cache refresh very well, and also hides contention for GPU memory. On X360, they allocate one thread for task management, four threads for tasks, and rendering and sound share one hardware thread as two software threads. Further accommodations are necessary for PS3 due to the greater number of cores, and the possibility of symmetric multiprocessing where tasks can be moved between processors to balance workload. Their PS3 engine attempts to order the processing of tasks to conceal thread switching and SPE job time.

Notice that the CPU Threads don't run concurrently, but the SPU sub threads do. Each PPU thread needs to have a logical PPU thread. There's one PPU, so all DMA marshalling occurs serially on the PPU. (There is an error in the diagram where I show the PPU threads overlapping. I'll fix that as soon as I can...) More information about Cell threads can be found here.

This architecture allows comparable performance on all their target platforms, with a single game side API. It further allows good load balancing, yielding little dead time on the various cores, no matter which system the code is executed on.

In order to accommodate synthesized animation, collision resolution is done first. All characters get dependent tasks farmed out for IK foot plant and self collision. Local physics for hair, clothes, and accessories are also distributed as tasks. The collision engine is simplified by restricting primitives to capsules and parallelepipeds.

Rendering occurs on a dedicated thread running in parallel with the other threads. Within the rendering thread there are many dependencies to account for. Transparency must be ordered in depth, reflection maps and shadow maps must be completed before they are needed. Post processing commands also have their appropriate place in the list. Drawing commands are added to an intermediate drawing command cue, where coherence between commands is managed before conversion to actual draw calls. The structure of a 64 bit command word is as follows:

 

  • Scene refers to parts of the screen such as viewports, overlays, and so on.
  • Sub-scene refers to the dependent components of a scene, such as shadow maps and reflection maps.
  • Pass refers to the render pass - z-prepass, visiblity pass, occlusion pass, filtering, etc.
  • Sub-priority indicates the ordering within a pass. This is used by the programmer to optimally schedule materials, and other rendering components before hand. The command is for the rendering operation to be performed.

Each thread has a private buffer in which draw commands are prepared. First, the private buffers are sorted using the first 32 bits of the command word as a key. The main thread uses a parallel merge sort to collect these buffers into a single buffer. Finally, the draw routine takes the merged list and issues its contents to the GPU. This structure allows maximal utilization of available processors and threads.

In the next stage, the individual queues are further merge sorted to prepare the final rendering list.

Note: I've done my best to translate details accurately. Feel free to post corrections in the comments.

CG/rendering/concurrency

Content by Nick Porcino (c) 1990-2011