December 20, 2020
The Apple M1 available in the MacBook Air, MacBook Pro 13”, and Mac Mini has been the focus of a ton of benchmarking writeups and blog posts about the new chip. The performance overall, and especially performance/watt, that Apple has achieved with the chip is very impressive. As a ray tracing person, what caught my eye the most was the performance AnandTech reported in their CineBench benchmarks. These scores were 1.6x higher than I got on my old Haswell desktop and 2x higher than my new Tiger Lake laptop! I had also been interested in trying out the new ray tracing API for Metal that was announced at WWDC this year, which bears some resemblance to the DirectX, Vulkan, and OptiX GPU ray tracing APIs. So, I decided to pick up a Mac Mini to do some testing on my own interactive path tracing project, ChameleonRT, and to get it running on the new Metal ray tracing API. In this post, we’ll take a look at the new Metal ray tracing API to see how it lines up with DirectX, Vulkan, OptiX and Embree, then we’ll make some fair (and some extremely unfair) ray tracing performance comparisons against the M1.
ChameleonRT is an open source interactive path tracer that I’ve been working on to learn the different ray tracing APIs, and to provide an example or starting point for myself and others working with them in other projects. ChameleonRT provides backends for the GPU ray tracing APIs: DirectX Ray Tracing, Vulkan KHR Ray Tracing, and OptiX 7. Through the work discussed in this post, it also has a new Metal GPU backend. ChameleonRT also has an Embree backend for fast multi-threaded and SIMD accelerated ray tracing on CPUs through Embree, ISPC, and TBB. The rendering code in each backend is nearly identical and they produce almost pixel exact outputs, with some possible small differences due to subtle differences in the ray tracing libraries or shader languages.
ChameleonRT is different from popular ray tracing benchmark applications like CineBench, LuxMark, and Blender Benchmark, in that ChameleonRT is a minimal interactive path tracer. CineBench, LuxMark, and Blender Benchmark are excellent for getting a full picture of what performance to expect from a production film renderer, as they are production renderers. However, there’s a lot more going on in a production renderer than just ray tracing to support the kind of complex geometries, materials, and effects used in film; and the large code bases can be challenging to quickly port to a new architecture or API. ChameleonRT is the exact opposite, supporting just one geometry type, triangle meshes, and one material type, the Disney BRDF. The rest of the code is similarly written to achieve interactive path tracing performance, though I have tried to balance this with how complex the code is to read. For example, the renderers use iterative ray tracing instead of recursion, support only a simple sampling strategy, and don’t support alpha cut-out effects. The use of iterative ray tracing is valuable on both the CPU and GPU, but is especially important on the GPU, as the overhead of recursive ray tracing calls in the pipeline can have a significant performance impact. Similarly, ignoring alpha cut-out effects allows the GPU renderers to skip using an any hit shader (and the Embree on to skip needing an intersection filter). The any hit shader (or intersection filter) would be called during BVH traversal after a candidate intersection with a triangle is found. In the case that the BVH traversal is hardware accelerated, this ends up producing a lot amount of back and forth between the fixed function traversal hardware and the shader cores to run the any hit shader during traversal, limiting the hardware’s performance.
What’s a path tracer? Path tracing is a technique in computer graphics for rendering photo-realistic images by simulating the light transport in the scene. It is a Monte Carlo technique, meaning that it randomly samples light paths emitted from lights in the scene that reach the camera by bouncing off objects in the scene. This is done by tracing the paths in reverse, starting from the camera and tracing the path back through the scene. A large number of light paths need to be sampled to produce a noise-free image. Path tracing is not restricted to photo-realism, and is the core rendering algorithm used in major film renderers, such as: Pixar’s Renderman, Disney’s Hyperion, Autodesk’s Arnold, and Weta’s Manuka. See Wikipedia for more.
The new ray tracing API in Metal will be familiar to those who’ve used inline ray tracing in DirectX or Vulkan. For a good introduction to Metal’s API, check out the Discover ray tracing with Metal video from WWDC 2020, and the accompanying sample. Inline ray tracing allows applications to make ray tracing calls in any shader stage, e.g., fragment/pixel shaders, vertex shaders, compute shaders, etc. Inline ray tracing allows ray tracing effects, for example accurate reflections and shadows, to be integrated into the regular rasterization pipeline. Inline ray tracing can also be used from a compute shader to implement a standalone path tracing based renderer. ChameleonRT is a standalone path tracer, and takes the compute shader approach for Metal.
At a high-level, ray tracing in Metal proceeds as follows: you upload your geometry data, build bottom-level primitive acceleration structures over them, then create instances referencing those acceleration structures and build a top-level acceleration structure over the instances. To render the scene you dispatch a compute shader and trace rays against the top-level acceleration structure to get back intersection results. The intersection results provide you with the intersected primitive and geometry IDs, and if using instancing, the instance ID. Your shader can then look up the geometry data for the object that was hit and shade it, after which it can continue tracing the path.
At this time Metal only supports inline ray tracing, in contrast to DirectX and Vulkan, which also provide support for ray tracing pipelines. Ray tracing pipelines are used to implement standalone ray tracing renderers, and is the approach ChameleonRT uses in its DirectX and Vulkan backends. OptiX only supports ray tracing pipelines. Ray tracing pipelines require the creation of a Shader Binding Table to provide the API with a table of functions to call when specific objects in the scene are intersected. The SBT can be difficult to setup and debug, and is a common point of difficulty for developers learning these APIs. For more information about the SBT, check out my post “The RTX Shader Binding Table Three Ways”. However, the potential benefit of the SBT is that the GPU can reorder or group function calls to reduce thread divergence. With inline ray tracing, the developer must do this themselves, or do without (check out another video from WWDC20 for information here). Right now, ChameleonRT does not do any reordering to reduce divergence.
Those familiar with Metal may know of the previous Metal Performance Shaders ray intersector. The new support for inline ray tracing improves on this by allowing the renderer to move entirely to the GPU. The previous ray intersector worked by taking batches of rays, finding intersections in the scene, and writing these results back out to memory. Multiple compute shaders would need to be dispatched to trace the primary rays and create shadow and secondary rays, then trace the shadow rays and continue the secondary rays while filtering out paths that terminated. This requires significantly more memory traffic and compute dispatches, introducing overhead to the renderer. Such an approach is also not well suited to augmenting traditional rasterization pipelines with ray tracing effects. Approaches based on tracing and sorting rays and intersection information to extract coherence are valuable for complex film renderers, and can be implemented efficiently using inline ray tracing as well. For example, Disney’s Hyperion renderer uses sorting and batching extensively to extract coherent workloads from an otherwise incoherent and divergent distribution of rays. There’s also an excellent higher-level video explaining how this works.
The Metal ray tracing API is quite nice to work with. It bears a lot of similarity to the other ray tracing APIs, but has been streamlined to be easier to use. For example, compare the code required to build and compact a BVH in DirectX or Vulkan with Metal. OptiX’s API provides a similar simplification, here’s the BVH build in OptiX for reference. It’s also nice to have templates and C++ style functionality in the shader language, a feature Metal shares with OptiX, which uses CUDA for the device side code.
This simplicity also has some drawbacks. For example, while in DirectX, Vulkan, and OptiX
you have control over where the acceleration structure memory is allocated,
Metal makes this allocation for you. As a result, you cannot allocate the acceleration
structures on a
MTLHeap, and so to make sure they’re available for your rendering
pipeline, you must individually mark them as used in a loop instead of a single
This can add some overhead if you have a lot of bottom level BVHs, as can be the
case in scenes with many instances.
It’d be nice to be able to allocate the acceleration structures on a heap, and replace this loop
with a single
The API is also simplified significantly by not including ray tracing pipelines and only requiring a Shader Binding Table-esque structure when implementing custom geometries or other operations that need to happen during traversal (e.g., alpha cut-outs). Code for managing the SBT setup in DirectX, Vulkan, and OptiX, makes up a significant portion of my helper code in each backend, and it’s nice to skip this in the Metal backend, which doesn’t implement custom geometry or alpha cut-out textures. However, in an SBT model the GPU could group or reorder function calls to reduce divergence, but this is less possible with inline ray tracing where this information isn’t available to the driver. Whether the current drivers for DirectX, Vulkan, and OptiX actually do this is a question for the driver teams, but in theory it’s possible. In the end you may find yourself implementing something a bit like the SBT but a bit simpler to look up the right data for the primtives/meshes/instances in your scene in argument buffers, as I’ve done in ChameleonRT; and it’s required to implement operations that take place during traversal (custom geometry, alpha cut-out transparency). Overall my comments here on things that could be a bit better are pretty minor.
What caught my interest the most about AnandTech’s CineBench results is that CineBench actually uses Embree for ray tracing, which is a library developed by Intel! Embree is a CPU ray tracing library that provides optimized acceleration structure traversal and primitive intersection kernels, filling a similar demand on the CPU as the new GPU ray tracing APIs do on the GPU. Embree was first released in 2011 and has found widespread adoption across film, scientific visualization, and other domains.
ChameleonRT also implements an Embree backend that makes use of Embree for fast ray traversal and primitive intersection, ISPCfor SIMD programming, and TBB for multi-threading. Getting this backend running on the M1 Arm was actually a bit easier than I expected. Syoyo Fujita already maintains an AArch64/NEON port of Embree, Embree-aarch64, which got some updates from the Apple Developer Ecosystem Engineering team to add an AVX2 on NEON backend. Embree is pretty optimized for 8-wide SIMD, so even though this requires using 2 NEON vectors to act as one AVX2 vector, it can provide better performance than a 4-wide backend on NEON.
ISPC was the dependency I was most concerned about getting running. ISPC is a compiler for SIMD programming on CPUs, where you write scalar program that is compiled to run in parallel on the CPU’s SIMD lanes, with one program instance per vector lane. Each program processes a different piece of data in parallel, executing in SIMD. It’s somewhat like GLSL or HLSL running on the CPU, where a scalar looking program is executed in parallel using SIMD. However, it turns out that ISPC already ships support for NEON and AArch64, and enabling a macOS + AArch64/NEON target turned out to be as easy as just making this combination of target OS and ISA flags not an error. I have a PR open to merge this support back in to ISPC.
Finally, while TBB is an Intel library for multithreading it doesn’t have any ties to a specific CPU architecture. All that’s needed here is to build TBB from source targeting AArch64.
With ChameleonRT now running on the M1’s CPU and GPU, we’re ready to do some benchmarks and see how the M1 looks on ray traversal performance! I ran these tests on a few different systems that I have access to. On the CPU side, I have an i7-1165G7 in an XPS 13 (thermals can be an issue here), an i7-4790K in an older desktop, and an i9-9920X in an Ubuntu workstation. On the GPU side, I have an RTX 2070. I have the M1 with 16GB RAM, which is in a Mac Mini. Before we begin here’s the CineBench numbers that got me interested in checking out the M1:
|CPU||CineBench R23 Score||Score relative to M1|
|i7-1165G7||4026||0.52x (M1 is ~1.93x faster)|
|i7-4790K||4579||0.59x (M1 is ~1.70x faster)|
|i9-9920X||14793*||1.9x (M1 is ~1.9x slower)|
Table 1: CineBench R23 scores on the systems tested. *The i9-9920X system is running Ubuntu, for which CineBench is not available. This score is reported from CGDirector.com.
For the benchmarks in ChameleonRT we’ll use the two scenes shown below: Sponza and San Miguel. Sponza is a small scene with 262K triangles, San Miguel is a decent size for interactive ray tracing with 9.96M triangles. Both are treated as a triangle soup by the BVH builder, meaning that a single bottom level acceleration structure is built containing all the triangles in the scene. This provides the BVH builder the most information possible about the geometry distribution so that it can build a high-quality acceleration structure. Both scenes are downloaded from Morgan McGuire’s meshes page and reexported in Blender to use OBJ material groups instead of per-triangle materials, as ChameleonRT doesn’t support per-triangle materials. The benchmarks are run rendering a 1280x720 image and run for ~200 frames, after which the average framerate (FPS) and million rays traced per second (MRay/s) are recorded.
Figure 1: The test scenes used in the benchmarks. Sponza (left) has 262K triangles, San Miguel (right) has 9.96M
The fair comparison to make is against the other mobile CPU, the i7-1165G7 in my XPS 13 using the Embree CPU backend. The comparisons are actually a bit in favor of the M1 here, as the XPS 13 will struggle a bit with thermals during the benchmarks. I also include my old desktop CPU, the i7-4790K, in this comparison since it scores similar to the i7-1165G7 on CineBench. When running ChameleonRT’s Embree backend for shorter bursts, the XPS 13 is actually able to outperform my desktop, though slows down after it heats up.
In these benchmarks I show the fastest SIMD configuration for the Intel CPUs. The i7-1165G7 has AVX512, allowing ISPC to run 16 “threads” in parallel and Embree to use up to AVX512. The i7-4790K supports AVX2, allowing ISPC to run 8 “threads” in parallel. On the M1 it is possible to use ISPC to compile a double-pumped NEON target, i.e., 8-wide SIMD execution on the 4-wide NEON registers, however I found that ISPC’s 4-wide target performed best. Embree on the M1 is configured to use AVX2 on NEON, as Embree is quite optimized for 8-wide SIMD.
Table 2: Benchmark results rendering Sponza using the Embree CPU backend.
Table 3: Benchmark results rendering San Miguel using the Embree CPU backend.
For the extremely unfair comparisons, we’ll compare the Metal GPU ray tracing backend on the M1 against DirectX ray tracing on an RTX 2070, and the Embree CPU backend against an i9-9920X. These comparisons are driven by my own curiosity. Since these are the highest end CPU and GPU systems I have access to, it’s interesting to see where the M1 lines up. However, I wouldn’t expect the M1 to provide competitive performance against these systems on this task. The RTX 2070 has a TDP of 175W and hardware accelerated ray tracing, while the entire M1 chip is estimated to be around 20-24W and does not have hardware to accelerate ray tracing. The i9-9920X has a TDP of 165W and support for AVX512 (16-wide SIMD) and AVX2 (8-wide SIMD). As seen in the previous benchmarks, the support for wider SIMD (AVX2) on the old i7-4790K still made it tough competition for the M1. In these benchmarks I found that the i9-9920X performed best when just using AVX2. As discussed by Travis Downs, the use of AVX512 on some CPUs can result in down clocking. Depending on how SIMD-friendly the workload is, it may actually perform better at a higher clock on a narrower SIMD width, which is the case here. Note that on the i7-1165G7, AVX512 performed slightly better than AVX2.
Table 4: Extremely unfair benchmarks on Sponza using the Embree CPU backend.
Table 5: Extremely unfair benchmarks on San Miguel using the Embree CPU backend.
|RTX 2070 (DirectX Ray Tracing)||135.5||757|
|Apple M1 (Metal)||3.60||20.1|
Table 6: Extremely unfair benchmark results on Sponza using the GPU backends.
|RTX 2070 (DirectX Ray Tracing)||63.8||362|
|Apple M1 (Metal)||2.06||11.7|
Table 7: Extremely unfair benchmark results on San Miguel using the GPU backends.
To wrap up, we can look back at the CineBench scores and the performance differences in the fair benchmarks on the CPU. The M1 scored 1.93x higher than the i7-1165G7 on CineBench, but in ChameleonRT’s Embree backend we found it was only 1.16x faster on average across the two scenes tested. Similarly, the M1 scored 1.70x higher than the i7-4790K on CineBench, but was actually 1.05x slower on average across the two scenes. What’s going on here?
It’s important to remember that ChameleonRT is not testing the same thing as CineBench. There’s a lot more going on in a production renderer like CineBench than in a minimal one like ChameleonRT. These other tasks, like intersecting more expensive geometries, evaluating more complex material models, and so on, can add up to a large percentage of the total execution time in a production renderer. ChameleonRT on the other hand has none of this, and is really just benchmarking ray traversal performance. So if we get similar ray traversal performance with Embree on the M1, but other things in CineBench are faster, we can get higher relative scores in CineBench than we do on traversal performance alone.
Overall the M1 is pretty stellar. I’ve been using the Mac Mini as my day to day computer for the past 2 weeks or so, and have been happy with both the performance and the total silence. There’s really something to be said for a practically silent computer, I didn’t hear the fan running across this set of benchmarks or when doing a parallel build of LLVM to build ISPC. What I’d be really excited to see in future M* chips would be support for 8-wide SIMD and hardware accelerated ray tracing. As part of the benchmark setup I ran the Embree backend on each machine with just SSE4 support (4-wide SIMD), and found it to be about 1.6-1.8x slower than when run with AVX2 (8-wide SIMD). Even assuming we get just 1.6x faster on the M1 with 8-wide SIMD, this would put the Embree CPU backend at ~2.75FPS on Sponza and ~2.06 FPS on San Miguel. Support for hardware accelerated ray tracing on the GPU would also be awesome for improving GPU ray tracing performance. I don’t have a non-RTX GPU around anymore to make a rough performance comparison, but I’d expect pretty substantial speedups (way above 2x). Even the current level of performance is impressive for a lightweight chip given that it doesn’t hit the same thermal issues as the XPS 13 and can provide somewhat better performance on the CPU with 1/4 the SIMD width (4 vs. 16), and has a GPU ray tracing API that provides a 1.6-2x speedup over its CPU in these benchmarks. It’ll be interesting to see what kind of hardware Apple has planned for the high end.