Will Usher's Blog

GPU Compute in the Browser at the Speed of Native: WebGPU Marching Cubes

2024-04-22T00:00:00-07:00

WebGPU is a powerful GPU-API for the web, providing support for advanced low-overhead rendering pipelines and GPU compute pipelines. WebGPU’s support for GPU compute shaders and storage buffers are a key distinction from WebGL, which lacks these features, and make it possible to bring powerful GPU-parallel applications to run entirely in the browser. These applications can range from GPGPU (e.g., simulations, data processing/analysis, machine learning, etc.) to GPU compute driven rendering pipelines, and applications across the spectrum.

In this post, we’ll evaluate WebGPU compute performance against native Vulkan by implementing the classic Marching Cubes algorithm in WebGPU. Marching Cubes is a nearly embarrassingly parallel algorithm, with two global reduction steps that must take place to synchronize work items and thread output locations. This makes it a great first GPU-parallel algorithm to try out on a new platform, as it has enough complexity to stress the API in a few different ways beyond simple parallel kernel dispatches but isn’t so complex as to take an substantial amount of time to implement or be bottlenecked by CPU performance.

From 0 to glTF with WebGPU: Rendering the Full glTF Scene

2023-06-24T00:00:00-07:00

Figure 1: In this post, we'll look at how to fix our terribly broken 2 Cylinder Engine. Left: A buggy render of 2CylinderEngine.glb achieved when ignoring the glTF node transformations. Right: The correct rendering with meshes positioned based on the hierarchy of transforms specified in the glTF node tree.

Loading and drawing our first mesh from a glTF file was quite a bit of work in the previous post, but with this core piece in place we can start adding a lot more functionality to our renderer pretty quickly. If you tried loading up glTF files into the renderer from the previous post, you may have noticed that they didn’t look how you expected. This is because glTF files often contain many meshes that make up different parts of the scene geometry, most of which will be missing since we only loaded the first mesh last time. If we just add a simple loop through the meshes to load and draw them all we’ll frequently end up with a scene like the broken engine on the left in the image above. This is because the meshes are reference and transformed by the glTF node hierarchy, and we need to load and handle these nested transformations to render the correct scene shown on the right. The test model we’ll be using for this post is the 2CylinderEngine from the Khronos glTF samples repo, which has nested transformations in its node hierarchy that make it a great test case. So grab 2CylinderEngine.glb and let’s get started!

From 0 to glTF with WebGPU: Rendering the First glTF Mesh

2023-05-16T00:00:00-07:00

Now that we’ve seen how to draw a triangle in the first post and hook up camera controls so we can look around in the second post, we’re at the point where the avocado really hits the screen and we can start drawing our first glTF primitives! I say the avocado hits the screen because that’s the glTF test model we’ll be using. You can grab it from the Khronos glTF samples repo. glTF files come in two flavors (minus other extension specific versions), a standard “.gltf” version that stores the JSON header in one file and binary data and textures in separate files, and a “.glb” version, that combines the JSON header and all binary or texture data into a single file. We’ll be loading .glb files in this series to simplify how many files we have to deal with to get a model into the renderer, so grab the glTF-Binary Avocado.glb and let’s get started!

Figure 1: It takes quite a bit to get Avocado.glb on the screen, but this beautiful image of our expected final (and delicious) result should be enough motivation to keep us going!

From 0 to glTF with WebGPU: Bind Groups - Updated for Chrome 113 Release

2023-04-11T00:00:00-07:00

In this second post of the series we’ll learn about Bind Groups, which let us pass buffers and textures to our shaders. When writing a renderer, we typically have inputs which do not make sense as vertex attributes (e.g., transform matrices, material parameters), or simply cannot be passed as vertex attributes (e.g., textures). Such parameters are instead passed as uniforms in GLSL terms, or root parameters in HLSL terms. The application then associates the desired buffers and textures with the parameters in the shader. In WebGPU, the association of data to parameters is made using Bind Groups. In this post, we’ll use Bind Groups to pass a uniform buffer containing a view transform to our vertex shader, allowing us to add camera controls to our triangle from the previous post. If you haven’t read the updated first post in this series I recommend reading that first, as we’ll continue directly off the code written there.

From 0 to glTF with WebGPU: The First Triangle - Updated for Chrome 113 Release

2023-04-10T00:00:00-07:00

This tutorial is an updated version of my previous one and updates the code listing to match the finalizing WebGPU APIs. If you’ve read the previous version of this tutorial you can skim through the code listings to get up to date. For an easy way to get started, I recommend grabbing my WebGPU + webpack starter template which includes the code from this tutorial. You can start by deleting the code there and rewriting it following this tutorial, or follow along in the code as you read this page. The code for the blog series is also available on GitHub.

WebGPU is a modern graphics API for the web, in development by the major browser vendors. When compared to WebGL, WebGPU provides more direct control over the GPU to allow applications to leverage the hardware more efficiently, similar to Vulkan and DirectX 12. WebGPU also exposes additional GPU capabilities not available in WebGL, such as compute shaders and storage buffers, enabling powerful GPU compute applications to run on the web. As with the switch from OpenGL to Vulkan, WebGPU exposes more complexity to the user than WebGL, though the API strikes a good balance between complexity and usability, and overall is quite nice to work with. In this series, we’ll learn the key aspects of WebGPU from the ground up, with the goal of going from zero to a basic glTF model renderer. This post marks our initial step on this journey, where we’ll setup a WebGPU context and get a triangle on the screen.

A Dive into Ray Tracing Performance on the Apple M1

2020-12-20T00:00:00-08:00

The Apple M1 available in the MacBook Air, MacBook Pro 13”, and Mac Mini has been the focus of a ton of benchmarking writeups and blog posts about the new chip. The performance overall, and especially performance/watt, that Apple has achieved with the chip is very impressive. As a ray tracing person, what caught my eye the most was the performance AnandTech reported in their CineBench benchmarks. These scores were 1.6x higher than I got on my old Haswell desktop and 2x higher than my new Tiger Lake laptop! I had also been interested in trying out the new ray tracing API for Metal that was announced at WWDC this year, which bears some resemblance to the DirectX, Vulkan, and OptiX GPU ray tracing APIs. So, I decided to pick up a Mac Mini to do some testing on my own interactive path tracing project, ChameleonRT, and to get it running on the new Metal ray tracing API. In this post, we’ll take a look at the new Metal ray tracing API to see how it lines up with DirectX, Vulkan, OptiX and Embree, then we’ll make some fair (and some extremely unfair) ray tracing performance comparisons against the M1.

Hardware Accelerated Video Encoding on the Raspberry Pi 4 on Ubuntu 20.04 64-bit

2020-11-15T00:00:00-08:00

I recently picked up a Raspberry Pi 4 8GB model to use for some lightweight server tasks on my home network. After setting up Pi-Hole, OpenVPN, Plex, and Samba, I got curious about using it to re-encode some videos I had. The videos are on an external drive being monitored by Plex and shared on the network by Samba, and some are quite large since they’re at a (likely unnecessarily) high bitrate. Trimming them down would help save a bit of space, and gives me an excuse to play around with Python, FFmpeg, and the Pi’s hardware accelerated video encoder. In this post, I’ll cover how to get FFmpeg setup to use the Pi 4’s video encoding hardware on a 64-bit OS and the little encoding manager/dashboard, FBED, that I put together to monitor the progress of the encoding tasks.

The VIS 2020 Streaming Infrastructure

2020-11-13T00:00:00-08:00

Now that it’s been a bit over a week since VIS 2020 ended I thought I’d write up some information on the streaming infrastructure we used during conference. For context, IEEE VIS 2020, like all conferences this year and likely well into the next, was held as a virtual event. VIS 2020 was “hosted” by the University of Utah, as it was originally planned (pre-COVID) to be held in Salt Lake City. My advisor was one of the co-chairs, and asked if I’d volunteer to be on the Technology Committee for VIS 2020. The role of this committee is to manage the technical aspects of streaming the event. The change to a virtual format brings a lot of challenges, especially when pivoting later in the planning cycle (the past in-person events are typically over a year in the making). However, the virtual format also brings improvements in terms of accessibility, cost to attendees, environmental impact, and archiving.

This post will be one part technical documentation and one part reflection. The feedback we received for VIS 2020 was overwhelmingly positive, and thus I hope that both the technical documentation on how we ran the event and the reflection on what worked and didn’t are helpful to organizers planning virtual events through the next year.

Before we begin, I must of course mention that this was not a solo effort. Alex Bock and Martin Falk were also on the Tech committee and provided valuable advice about their experience running EGEV 2020 as a virtual event earlier this year, which was also well received. We followed the same model for VIS, which aims to keep the feeling of a live conference while reducing surface area for technical issues. I must also mention the amazing work done by Alper Sarikaya, Hendrik Strobelt, Jagoda Walny, and Steve Petruzza on the web committee setting up the virtual conference webpage. The webpage was adapted from mini-conf, originally written by Alexander Rush and Hendrik Strobelt. Alper has written up a blog post about this, so I won’t cover it here. Finally, during the event we had a rotation of about 24 student volunteers who were responsible for managing the streams and assisting presenters with technical issues, without whom the event would not have been possible.

From 0 to glTF with WebGPU: Bind Groups

2020-06-20T00:00:00-07:00

In this second post of the series we’ll learn about Bind Groups, which let us pass buffers and textures to our shaders. When writing a renderer, we typically have inputs which do not make sense as vertex attributes (e.g., transform matrices, material parameters), or simply cannot be passed as vertex attributes (e.g., textures). Such parameters are instead passed as uniforms in GLSL terms, or root parameters in HLSL terms. The application then associates the desired buffers and textures with the parameters in the shader. In WebGPU, the association of data to parameters is made using Bind Groups. In this post, we’ll use Bind Groups to pass a uniform buffer containing a view transform to our vertex shader, allowing us to add camera controls to our triangle from the previous post. If you haven’t read the first post in this series I recommend reading that first, as we’ll continue directly off the code written there.

This post has been updated to reflect changes to the WebGPU API! Please see the updated post.

From 0 to glTF with WebGPU: The First Triangle

2020-06-15T00:00:00-07:00

This post has been updated to reflect changes to the WebGPU API! Please see the updated post.

The RTX Shader Binding Table Three Ways

2019-11-20T00:00:00-08:00

DirectX Ray Tracing, Vulkan’s NV Ray Tracing extension, and OptiX (or collectively, the RTX APIs) build on the same execution model for running user code to trace and process rays. The user creates a Shader Binding Table (SBT), which consists of a set of shader function handles and embedded parameters for these functions. The shaders in the table are executed depending on whether or not a geometry was hit by a ray, and which geometry was hit. When a geometry is hit, a set of parameters specified on both the host and device side of the application combine to determine which shader is executed. The RTX APIs provide a great deal of flexibility in how the SBT can be set up and indexed into during rendering, leaving a number of options open to applications. However, with incorrect SBT access leading to crashes and difficult bugs, sparse examples or documentation, and subtle differences in naming and SBT setup between the APIs, properly setting up and accessing the SBT is an especially thorny part of the RTX APIs for new users.

In this post we’ll look at the similarities and differences of each ray tracing API’s shader binding table to gain a fundamental understanding of the execution model. I’ll then present an interactive tool for constructing the SBT, building a scene which uses it, and executing trace calls on the scene to see which hit groups and miss shaders are called. Finally, we’ll look at how this model can be brought back to the CPU using Embree, to potentially build a unified low-level API for ray tracing.

Updated 5/1/2020: Added discussion on KHR_ray_tracing for Vulkan.

Faster Shadow Rays on RTX

2019-09-06T00:00:00-07:00

To determine if a hit point can be directly lit by a light source in the scene we need to perform a visibility test between the point and the light. In a path tracer we must perform at least one visibility test per hit point to shade the point, or two if we’re using multiple importance sampling (one for the light sample, and one for the BSDF sample). When rendering just ambient occlusion, e.g., for baking occlusion maps, we may send even more shadow rays per hit-point. Fortunately, shadow rays can be relatively cheap to trace, as we don’t care about finding the closest hit point or computing surface shading information, but just whether or not something is intersected by the ray. There are a few options and combinations of ray flags which we can use when deciding how to trace shadow rays on RTX (through DXR, OptiX or Vulkan). I recently learned a method for skipping all hit group shaders (any hit, closest hit) and instead using just the miss shader to determine if the ray is not occluded. This was a bit non-obvious to me, though has been used by others (see Chris Wyman’s Intro to DXR and Sascha Willems’s Ray Tracing Shadows Example After switching to this approach in ChameleonRT I decided to run a small benchmark comparing some of the options for tracing shadow rays. I’ll also discuss an extra trick we can use to simplify the shader binding table setup, which lets us skip creating an occlusion hit group entirely.

Volume Rendering with WebGL

2019-01-13T00:00:00-08:00

Figure 1: Example volume renderings, using the WebGL volume renderer described in this post. Left: A simulation of the spatial probability distribution of electrons in a high potential protein molecule. Right: A CT scan of a Bonsai Tree. Both datasets are from the Open SciVis Datasets repository.

In scientific visualization, volume rendering is widely used to visualize 3D scalar fields. These scalar fields are often uniform grids of values, representing, for example, charge density around a molecule, an MRI or CT scan, air flow around an airplane, etc. Volume rendering is a conceptually straightforward method for turning such data into an image: by sampling the data along rays from the eye and assigning a color and transparency to each sample, we can produce useful and beautiful images of such scalar fields (see Figure 1). In a GPU renderer, these 3D scalar fields are stored as 3D textures; however, in WebGL1 3D textures were not supported, requiring additional hacks to emulate them for volume rendering. Recently, WebGL2 added support for 3D textures, allowing for an elegant and fast volume renderer to be implemented entirely in the browser. In this post we’ll discuss the mathematical background for volume rendering, and how it can be implemented in WebGL2 to create an interactive volume renderer entirely in the browser! Before we start, you can try out the volume renderer described in this post online.

Comments in LaTeX

2018-07-10T00:00:00-07:00

When writing a paper in LaTeX, it’s common to leave notes and comments in the text, either to yourself or your co-authors. I used to write these as just different colored text using \textcolor{...}, with each author assigned a color, or all with the same color. However, with more authors it can get hard to keep picking legible font colors. Futhermore, sometimes just a different color font doesn’t stand out quite as much as I’d like from the rest of the text. More recently I’ve switched to using highlights for the comments, which works well with multiple authors, and helps the comments stand out from the rest of the text. This is easy to do with the soul and xcolor packages.

Distributed Rendering with Rust and Mio

2016-01-02T00:00:00-08:00

In this post we’ll take a look at adding distributed rendering to tray_rust which will let us take advantage of multiple machines when rendering an image, like a compute cluster. To do this we’ll look at options for how to distribute the rendering job across multiple nodes and what sort of communication is needed synchronize their work. We’ll also look into how we can use mio to write an efficient master process that can manage multiple workers effectively.

After implementing a simple technique to distribute the job we’ll discuss the scalability of this approach and possible paths forward to improve it. I’ve also recently written a plugin for Blender so you can easily create your own scenes and will mention a bit on how to run the ray tracer on Google Compute Engine (or AWS EC2) if you want to try out the distributed rendering yourself.

Rendering an Animation in Rust

2015-12-16T00:00:00-08:00

In this post we’ll look at adding a pretty awesome new feature to tray_rust, something I’ve never implemented before: animation! We’ll take a look at a simple way for sampling time in our scene, how we can associate time points with transformations of objects to make them move and how to compute smooth animation paths with B-Splines. Then we’ll wrap up with rendering a really cool animation by using 60 different machines spread across two clusters at my lab.

Porting a Ray Tracer to Rust, part 3

2015-05-15T00:00:00-07:00

It’s been a little while since my last post on tray_rust as I’ve been a busy with classes, but I’ve had a bit of free time to implement some extremely cool features. In this post we’ll look at porting over the path tracing code and adding a bounding volume hierarchy, along with adding support for triangle meshes and measured material data from the MERL BRDF Database introduced by Matusik et al. in 2003 in A Data-Driven Reflectance Model. In the process of implementing the BVH we’ll get a taste of Rust’s generic programming facilities and use them to write a flexible BVH capable of storing any type that can report its bounds. In the spirit of fogleman’s gorgeous Go Gopher in Go we’ll wrap up by rendering the Rust logo in Rust using a model made by Nylithius on BlenderArtists.

If you’ve been following Rust’s development a bit you have probably noticed that the timing of this post is not a coincidence, since Rust 1.0.0 is being released today!

Porting a Ray Tracer to Rust, part 2

2015-01-30T00:00:00-08:00

As mentioned in my previous post I spent the past month-ish working on improving both the rendering capabilities and performance of tray_rust. While it’s not yet capable of path tracing we can at least have light and shadow and shade our objects with diffuse or specularly reflective and/or transmissive materials. Along with this I’ve improved performance by parallelizing the rendering process using Rust’s multithreading capabilities. Although ray tracing is a trivially parallel task there are two pieces of state that must be shared and modified between threads: the pixel/block counter and the framebuffer. With Rust’s strong focus on safety I was worried that I would have to resort to unsafe blocks to share these small pieces of mutable state but I found that the std::sync module provided safe methods for everything I needed and performs quite well. While it’s difficult to compare against tray (my initial C++ version) as the design of tray_rust has diverged quite a bit I’ll put some performance numbers in the multithreading section.

During the past month Rust has also seen some pretty large changes and is currently in its 1.0 alpha release with the first beta fast approaching.

Porting a Ray Tracer to Rust, part 1

2014-12-30T00:00:00-08:00

I’ve decided to port over my physically based ray tracer tray to Rust to finally try out the language with a decent sized project. In the series we’ll work through the implementation of a physically based ray tracer built on the techniques discussed in Physically Based Rendering. I won’t go into a lot of detail about rendering theory or the less exciting implementation details but will focus more on Rust specific concerns and implementation decisions along with comparisons vs. my C++ version. If you’re looking to learn more about ray tracing I highly recommend picking up Physically Based Rendering and working through it. Hopefully throughout the series folks more experienced with Rust can point out mistakes and improvements as well, since I have no experience with Rust prior to this series.

With the intro out of the way, let’s get started! Since it’s the beginning of the series this is my first time really working with Rust and our goal is pretty simple: render a white sphere and save the image.

Postscript 1: Easy Cleanup

2014-08-01T00:00:00-07:00

In this quick postscript we’ll look into a simple way to clean up our various SDL resources with variadic templates and template specialization. This will let us clean up all our resources with a single simple call: cleanup(texA, texB, renderer, window) instead of calling all the corresponding SDL_Destroy/Free* functions, saving ourselves a lot of typing.

We’ll do this by creating a variadic function cleanup that will take the list of SDL resources to be free’d and then define specializations of it for each resource we’ll be passing, eg. for SDL_Window, SDL_Renderer, SDL_Texture and so on.

Postscript 0: Properly Finding Resource Paths

2014-06-16T00:00:00-07:00

In this short postscript we’ll learn how to make use of SDL_GetBasePath to properly resolve the path to our resource directory where we’ll be storing all the assets needed for each lesson. This approach lets us avoid issues with relative paths since it doesn’t depend on where the program working directory is set when it’s run. This functionality was introduced in SDL 2.0.1 so if you haven’t updated to the latest SDL be sure to grab that before getting started.

Lesson 0: CMake

2014-03-06T00:00:00-08:00

CMake is really useful for building the lessons since it lets us generate make files or project files for just about any platform and IDE. It also helps with resolving dependencies (such as SDL2), platform specific configurations and much much more. If you’re unfamiliar with CMake there’s a nice introduction available on their site to help you get started.

Lesson 6: True Type Fonts with SDL_ttf

2013-12-18T00:00:00-08:00

In this lesson we’ll see how to perform basic True Type font rendering with the SDL_ttf extension library. Setting up the library is identical to what we did in Lesson 3 for SDL_image, but just replace “image” with “ttf” (Windows users should also copy the included freetype dll over). So download SDL_ttf, take a peek at the documentation, and let’s get started!

Lesson 5: Clipping Sprite Sheets

2013-08-27T00:00:00-07:00

It’s common in sprite based games to use a larger image file containing many smaller images, such as the tiles for a tileset, instead of having a separate image file for each tile. This type of image is known as a sprite sheet and is very handy to work with since we don’t need to change which texture we’re drawing each time but rather just which subsection of the texture.

Lesson 4: Handling Events

2013-08-20T00:00:00-07:00

In this lesson we’ll learn the basics of reading user input with SDL, in this simple example we’ll interpret any input as the user wanting to quit our application. To read events SDL provides the SDL_Event union and functions to get events from the queue such as SDL_PollEvent. The code for this lesson is built off of the lesson 3 code, if you need that code to start from grab it on Github and let’s get started!

Lesson 3: SDL Extension Libraries

2013-08-18T00:00:00-07:00

Up until now we’ve only been using BMP images as they’re the only type supported by the base SDL library, but being restricted to using BMP images isn’t that great. Fortunately there are a set of SDL extension libraries that add useful features to SDL, such as support for a wide variety of image types through SDL_image. The other available libraries are SDL_ttf which provides TTF rendering support, SDL_net which provides low level networking and SDL_mixer which provides multi-channel audio playback.

Lesson 2: Don't Put Everything in Main

2013-08-17T00:00:00-07:00

In this lesson we’ll begin organizing our texture loading and rendering code from the previous lesson by moving them out of main and placing them into some useful functions. We’ll also write a simple generic SDL error logger and learn how images are positioned and scaled when rendering with SDL.

Lesson 1: Hello World

2013-08-17T00:00:00-07:00

In this lesson we’ll learn how to open a window, create a rendering context and draw an image we’ve loaded to the screen. Grab the BMP we’ll be drawing below and save it somewhere in your project and let’s get started!

Lesson 0: Visual Studio

2013-08-15T00:00:00-07:00

Now that we’ve got the libraries installed we’ll want to create a new project to include and link against SDL. At the end we’ll save this as a template project so in the future we can just load our template and get to work. First we need a new empty C++ project.

Lesson 0: Setting Up SDL

2013-08-15T00:00:00-07:00

The first step is to get the SDL2 development libraries setup on your system, you can download them from the SDL2 downloads page.

Lesson 0: MinGW

2013-08-15T00:00:00-07:00

To build the projects with mingw we’ll be using a lightweight makefile that will set the include and library paths along with linking our dependencies for us. The makefile assumes that you’ve placed the SDL mingw development libraries under C:/SDL2-2.0.0-mingw/ and that you’re using the 32bit version of mingw and the 32bit libraries. You should change this to match your compiler (32/64bit) and the location of your SDL folder. To use makefiles with mingw call mingw32-make.exe in the folder containing the makefile.

If you’re unfamiliar with Makefiles a basic introduction can be found here.

Lesson 0: Mac Command Line

2013-08-15T00:00:00-07:00

To build the projects on OS X we’ll be using a simple makefile that will include the framework for us. The makefile assumes you’ve installed SDL following the instructions in the .dmg file on the SDL2 downloads page and now have it available as a framework.

If you’re unfamiliar with Makefiles a basic introduction can be found here.

Lesson 0: Linux Command Line

2013-08-15T00:00:00-07:00

To build the projects on Linux we’ll be using a simple makefile that will setup the include and library dependencies for us. The makefile assumes that your SDL libraries are installed under /usr/local/lib and the headers are under /usr/local/include. These are the install locations if you built the project through cmake, some more detail on building from source can be found here. If you’ve installed it through your package manager or placed the libraries and headers elsewhere you may need to change these paths to match your installation. You can also check the output of sdl2-config with the --cflags and --libs switches to locate your install, assuming you haven’t moved it.

If you’re unfamiliar with Makefiles a basic introduction can be found here.

Speculative Progressive Raycasting for Memory Constrained Isosurface Visualization of Massive Volumes

2023-10-22T23:00:00-07:00

Fig 1:Interactive full-resolution isosurface visualization of the 2048 × 2048 × 1920 Richtmyer-Meshkov (R-M) data set in the browser. We propose a new GPU algorithm for implicit isosurface rendering that progressively traverses rays through the volume and decompresses data on-demand to minimize its memory footprint. We achieve up to 5.7× reductions in overall memory use and 8.4× reductions in data decompressed compared to the state of the art in memory constrained isosurface extraction, without sacrificing interactivity. At 1280 × 720, the Richtmyer-Meshkov averages 264ms per-pass and 1.2s total on an RTX 3080.

New web technologies have enabled the deployment of powerful GPU-based computational pipelines that run entirely in the web browser, opening a new frontier for accessible scientific visualization applications. However, these new capabilities do not address the memory constraints of lightweight end-user devices encountered when attempting to visualize the massive data sets produced by today’s simulations and data acquisition systems. In this paper, we propose a novel implicit isosurface rendering algorithm for interactive visualization of massive volumes within a small memory footprint. We achieve this by progressively traversing a wavefront of rays through the volume and decompressing blocks of the data on-demand to perform implicit ray-isosurface intersections. The progressively rendered surface is displayed after each pass to improve interactivity. Furthermore, to accelerate rendering and increase GPU utilization, we introduce speculative ray-block intersection into our algorithm, where additional blocks are traversed and intersected speculatively along rays as other rays terminate to exploit additional parallelism in the workload. Our entire pipeline is run in parallel on the GPU to leverage the parallel computing power that is available even on lightweight end-user devices. We compare our algorithm to the state of the art in low-overhead isosurface extraction and demonstrate that it achieves 1.7×–5.7× reductions in memory overhead and up to 8.4× reductions in data decompressed. Continue Reading

Ray Tracing Spherical Harmonics Glyphs

2023-10-05T23:00:00-07:00

Fig 1:Left: We render spherical harmonics glyphs using an efficient and accurate method based on polynomial root finding. A second glyph visualizes uncertainty and we trace secondary rays to render soft shadows. Right: Our method renders large HARDI datasets in real time on a single GPU.

Spherical harmonics glyphs are an established way to visualize high angular resolution diffusion imaging data. Starting from a unit sphere, each point on the surface is scaled according to the value of a linear combination of spherical harmonics basis functions. The resulting glyph visualizes an orientation distribution function. We present an efficient method to render these glyphs using ray tracing. Our method constructs a polynomial whose roots correspond to ray-glyph intersections. This polynomial has degree 2k+2 for spherical harmonics bands 0,2,...,k. We then find all intersections in an efficient and numerically stable fashion through polynomial root finding. Our formulation also gives rise to a simple formula for normal vectors of the glyph. Additionally, we compute a nearly exact axis-aligned bounding box to make ray tracing of these glyphs even more efficient. Since our method finds all intersections for arbitrary rays, it lets us perform sophisticated shading and uncertainty visualization. Compared to prior work, it is faster, more flexible and more accurate. Continue Reading

GraphWaGu: GPU Powered Large Scale Graph Layout Computation and Rendering for the Web

2022-05-08T23:00:00-07:00

Large scale graphs are used to encode data from a variety of application domains such as social networks, the web, biological networks, road maps, and finance. Computing enriching layouts and interactive rendering play an important role in many of these applications. However, producing an efficient and interactive visualization of large graphs remains a major challenge, particularly in the web-browser. Existing state of the art web-based visualization systems such as D3.js, Stardust, and NetV.js struggle to achieve interactive layout and visualization for large scale graphs. In this work, we leverage the latest WebGPU technology to develop GraphWaGu, the first WebGPU-based graph visualization system. WebGPU is a new graphics API that brings the full capabilities of modern GPUs to the web browser. Leveraging the computational capabilities of the GPU using this technology enables GraphWaGu to scale to larger graphs than existing technologies. GraphWaGu embodies both fast parallel rendering and layout creation using modified Frutcherman-Reingold and Barnes-Hut algorithms implemented in WebGPU compute shaders. Experimental results demonstrate that our solution achieves the best performance, scalability, and layout quality when compared to current state of the art web-based graph visualization libraries. All of our source code for the project is available at https://github.com/harp-lab/GraphWaGu . Continue Reading

Design and Evaluation of a GPU Streaming Framework for Visualizing Time-Varying AMR Data

2022-05-08T23:00:00-07:00

Fig 1:The NASA Exajet serves as the motivating use case for our study. The large computational fluid dynamics data set was computed using an adaptive mesh refinement (AMR) code and consists of 656 million cells and 423 time steps. Each time step stores 2.5 GB of data per scalar field. At four scalar fields for density and X/Y/Z velocity components, the full time series occupies over 4 TB. Data sets such as these pose significant challenges for interactive visualization on current GPU workstations. We present and evaluate a prototypical framework targeting GPU workstations that asynchronously streams and renders such data sets at interactive rates and with high quality.

We describe a systematic approach for rendering time-varying simulation data produced by exa-scale simulations, using GPU workstations. The data sets we focus on use adaptive mesh refinement (AMR) to overcome memory bandwidth limitations by representing interesting regions in space with high detail. Particularly, our focus is on data sets where the AMR hierarchy is fixed and does not change over time. Our study is motivated by the NASA Exajet, a large computational fluid dynamics simulation of a civilian cargo aircraft that consists of 423 simulation time steps, each storing 2.5 GB of data per scalar field, amounting to a total of 4 TB. We present strategies for rendering this time series data set with smooth animation and at interactive rates using current generation GPUs. We start with an unoptimized baseline and step by step extend that to support fast streaming updates. Our approach demonstrates how to push current visualization workstations and modern visualization APIs to their limits to achieve interactive visualization of exa-scale time series data sets. Continue Reading

Adaptive Multiresolution Techniques for I/O, Data Layout, and Visualization of Massive Simulations

2021-03-16T23:00:00-07:00

Fig 1:An illustration of how this dissertation's contributions fit together in the broader scope of the end-to-end simulation visualization pipeline. The contributions discussed work together to support massive data sets across the simulation visualization pipeline.

The continuing growth in computational power available on high-performance computing systems has allowed for increasingly higher fidelity simulations. As these simulations grow in scale, the amount of data produced grows correspondingly, challenging existing strategies for I/O and visualization. Moreover, although prior work has sought to achieve high bandwidth I/O at scale or post hoc visualization of massive data sets, treating these two sides of the simulation visualization pipeline independently introduces a bottleneck between them, where data must be converted from the simulation output layout to the layout used for visualization.
The aim of this dissertation is to address key challenges across the simulation visualization pipeline to provide efficient end-to-end support for massive data sets. First, this dissertation proposes an I/O approach for particle data that rebalances the I/O workload on nonuniform distributions by constructing a spatial data structure, improving I/O performance and portability. To eliminate data layout bottlenecks between I/O and visualization, this dissertation proposes a layout for particle data that balances rendering and attribute-query access patterns through a spatial $k$-d tree and fixed size bitmap indices. The layout is constructed quickly when writing the data and requires little additional memory to store. This dissertation proposes a number of approaches to enable visualization of massive data sets at different scales. An asynchronous tile-based processing pipeline is proposed for distributed full-resolution rendering that overlaps compositing and rendering tasks to improve performance. Next, a virtual reality tool for neuron tracing in large connectomics data is proposed. The VR tool employs an intuitive painting metaphor and a real-time page-based data processing system to visualize large data without discomfort. To visualize massive data in compute and memory constrained environments, a GPU parallel isosurface extraction algorithm is proposed for block-compressed data sets. The algorithm is built on a GPU-driven memory management and caching strategy that allows working with compressed data sets entirely on the GPU. Finally, a simulation-oblivious approach to data transfer for loosely coupled in situ visualization is proposed that minimizes the impact of the visualization by offloading data restructuring from the simulation. Continue Reading

Adaptive Spatially Aware I/O for Multiresolution Particle Data Layouts

2021-05-16T23:00:00-07:00

Fig 1:An overview of our adaptive two-phase I/O pipeline. (a) Given the number of particles on each rank, rank 0 constructs the Aggregation Tree to create leaves with similar numbers of particles. Each leaf is assigned to a rank responsible for aggregating the data and writing it to disk. (b) Each rank sends its data to its aggregator. (c) Each aggregator constructs our multiresolution data layout and writes it to disk. (d) The aggregators send the local value ranges and root node bitmaps for each attribute to rank 0, which populates the Aggregation Tree with the bitmaps and writes it out.

Large-scale simulations on nonuniform particle distributions that evolve over time are widely used in cosmology, molecular dynamics, and engineering. Such data are often saved in an unstructured format that neither preserves spatial locality nor provides metadata for accelerating spatial or attribute subset queries, leading to poor performance of visualization tasks. Furthermore, the parallel I/O strategy used typically writes a file per process or a single shared file, neither of which is portable or scalable across different HPC systems. We present a portable technique for scalable, spatially aware adaptive aggregation that preserves spatial locality in the output. We evaluate our approach on two supercomputers, Stampede2 and Summit, and demonstrate that it outperforms prior approaches at scale, achieving up to 2.5× faster writes and reads for nonuniform distributions. Furthermore, the layout written by our method is directly suitable for visual analytics, supporting low-latency reads and attribute-based filtering with little overhead. Continue Reading

Interactive Visualization of Terascale Data in the Browser: Fact or Fiction?

2020-10-24T23:00:00-07:00

Fig 1:Interactive visualization of an isosurface of a ~1TB dataset entirely in the web browser. The full data is a float64 10240×7680×1536 grid computed by a DNS simulation. The isosurface is interactively computed and visualized entirely in the browser using our GPU isosurface computation algorithm for block-compressed data, after applying advanced precision and resolution trade-offs. The surface consists of 137.5M triangles and is computed in 526ms on an RTX 2070 using WebGPU in Chrome and rendered at 30FPS at 1280×720. The original surface, shown in the right split image, consists of 4.3B triangles and was computed with VTK's Flying Edges filter on a quad-socket Xeon server in 78s using 1.3TB of memory.

Information visualization applications have become ubiquitous, in no small part thanks to the ease of wide distribution and deployment to users enabled by the web browser. Scientific visualization applications, relying on native code libraries and parallel processing, have been less suited to such widespread distribution, as browsers do not provide the required libraries or compute capabilities. In this paper, we revisit this gap in visualization technologies and explore how new web technologies, WebAssembly and WebGPU, can be used to deploy powerful visualization solutions for large-scale scientific data in the browser. In particular, we evaluate the programming effort required to bring scientific visualization applications to the browser through these technologies and assess their competitiveness against classic native solutions. As a main example, we present a new GPU-driven isosurface extraction method for block-compressed data sets, that is suitable for interactive isosurface computation on large volumes in resource-constrained environments, such as the browser. We conclude that web browsers are on the verge of becoming a competitive platform for even the most demanding scientific visualization tasks, such as interactive visualization of isosurfaces from a 1TB DNS simulation. We call on researchers and developers to consider investing in a community software stack to ease use of these upcoming browser features to bring accessible scientific visualization to the browser. Continue Reading

Improving the Usability of Virtual Reality Neuron Tracing with Topological Elements

2020-10-26T23:00:00-07:00

Fig 1:Left to right: A connected graph of ridge-like structures is extracted from the Morse-Smale complex (MSC), containing a superset of the possible neuron segments in the data. Our MSC-guided semi-automatic tracing tool enables users to rapidly trace paths and view a live preview as they do so (orange line). When satisfied with the trace, they can add it to the reconstruction (white line).

Researchers in the field of connectomics are working to reconstruct a map of neural connections in the brain in order to understand at a fundamental level how the brain processes information. Constructing this wiring diagram is done by tracing neurons through high-resolution image stacks acquired with fluorescence microscopy imaging techniques. While a large number of automatic tracing algorithms have been proposed, these frequently rely on local features in the data and fail on noisy data or ambiguous cases, requiring time-consuming manual correction. As a result, manual and semi-automatic tracing methods remain the state-of-the-art for creating accurate neuron reconstructions. We propose a new semi-automatic method that uses topological features to guide users in tracing neurons and integrate this method within a virtual reality (VR) framework previously used for manual tracing. Our approach augments both visualization and interaction with topological elements, allowing rapid understanding and tracing of complex morphologies. In our pilot study, neuroscientists demonstrated a strong preference for using our tool over prior approaches, reported less fatigue during tracing, and commended the ability to better understand possible paths and alternatives. Quantitative evaluation of the traces reveals that users' tracing speed increased, while retaining similar accuracy compared to a fully manual approach. Continue Reading

Ray Tracing Structured AMR Data Using ExaBricks

2020-10-26T23:00:00-07:00

Fig 1:The Exajet contains an AMR simulation of air flow around the left side of a plane, and consists of 656M cells (across four refinement levels) plus 63.2M triangles. For rendering we mirror the data set via instancing, resulting in effectively 1.31B instanced cells and 126M instanced triangles. This visualization—rendered with our method—shows flow vorticity and velocity, with an implicitly ray-traced iso-surface of the vorticity (color-mapped by velocity), plus volume ray tracing of the vorticity field. At a resolution of 2500×625, and running on a workstation with two RTX 8000 GPUs, this configuration renders in roughly 252 milliseconds.

Structured Adaptive Mesh Refinement (Structured AMR) enables simulations to adapt the domain resolution to save computation and storage, and has become one of the dominant data representations used by scientific simulations; however, efficiently rendering such data remains a challenge. We present an efficient approach for volume- and iso-surface ray tracing of Structured AMR data on GPU-equipped workstations, using a combination of two different data structures. Together, these data structures allow a ray tracing based renderer to quickly determine which segments along the ray need to be integrated and at what frequency, while also providing quick access to all data values required for a smooth sample reconstruction kernel. Our method makes use of the RTX ray tracing hardware for surface rendering, ray marching, space skipping, and adaptive sampling; and allows for interactive changes to the transfer function and implicit iso-surfacing thresholds. We demonstrate that our method achieves high performance with little memory overhead, enabling interactive high quality rendering of complex AMR data sets on individual GPU workstation. Continue Reading

Accelerating Unstructured Mesh Point Location with RT Cores

2020-11-30T22:00:00-08:00

Fig 1:The Agulhas Current dataset, courtesy Niklas Röber, DKRZ. This image shows simulated ocean currents off the coast of South Africa, represented using cell-centered wedges. When rendered using our hardware accelerated point queries, we see up to a 14.86× performance improvement over a CUDA reference implementation (2.49 FPS vs 37 FPS on an RTX 2080 at 1024×1024).

We present a technique that leverages ray tracing hardware available in recent Nvidia RTX GPUs to solve a problem other than classical ray tracing. Specifically, we demonstrate how to use these units to accelerate the point location of general unstructured elements consisting of both planar and bilinear faces. This unstructured mesh point location problem has previously been challenging to accelerate on GPU architectures; yet, the performance of these queries is crucial to many unstructured volume rendering and compute applications. Starting with a CUDA reference method, we describe and evaluate three approaches that reformulate these point queries to incrementally map algorithmic complexity to these new hardware ray tracing units. Each variant replaces the simpler problem of point queries with a more complex one of ray queries. Initial variants exploit ray tracing cores for accelerated BVH traversal, and subsequent variants use ray-triangle intersections and per-face metadata to detect point-in-element intersections. Although these later variants are more algorithmically complex, they are significantly faster than the reference method thanks to hardware acceleration. Using our approach, we improve the performance of an unstructured volume renderer by up to 4× for tetrahedral meshes and up to 15× for general bilinear element meshes, matching, or out-performing state-of-the-art solutions while simultaneously improving on robustness and ease-of-implementation. Continue Reading

Efficient and Flexible Hierarchical Data Layouts for a Unified Encoding of Scalar Field Precision and Resolution

2020-10-26T23:00:00-07:00

Fig 1:We propose a hierarchical data layout that allows for various forms of progressive decoding that modulate improvements in both precision and resolution. Each progressive decoding traces a monotonic nondecreasing curve in the precision-resolution space from the origin, 0%, to the full data, 100% (shown in (a)). Using a 900 GB turbulent channel flow field (10240×7680×1536, float64) (b), we demonstrate three approximations (c,d,e) of progressively increasing quality decoded along the curve in (a). The time to decode the data and RAM used are shown in the figure; data retrieved values are inclusive of the preceding points along the curve

To address the problem of ever-growing scientific data sizes making data movement a major hindrance to analysis, we introduce a novel encoding for scalar fields: a unified tree of resolution and precision, specifically constructed so that valid cuts correspond to sensible approximations of the original field in the precision-resolution space. Furthermore, we introduce a highly flexible encoding of such trees that forms a parameterized family of data hierarchies. We discuss how different parameter choices lead to different trade-offs in practice, and show how specific choices result in known data representation schemes such as ZFP, IDX, and JPEG2000. Finally, we provide system-level details and empirical evidence on how such hierarchies facilitate common approximate queries with minimal data movement and time, using real-world data sets ranging from a few gigabytes to nearlya terabyte in size. Experiments suggest that our new strategy of combining reductions in resolution and precision is competitive with state-of-the-art compression techniques with respect to data quality, while being significantly more flexible and orders of magnitude faster, and requiring significantly reduced resources Continue Reading

A Virtual Frame Buffer Abstraction for Parallel Rendering of Large Tiled Display Walls

2020-10-26T23:00:00-07:00

Fig 1:Left: The Disney Moana Island rendered remotely with OSPRay's path tracer at full detail using 128 Skylake Xeon (SKX) nodes on Stampede2 and streamed to the 132Mpixel POWERwall display wall, averages 0.2-1.2 FPS. Right: The Boeing 777 model, consisting of 349M triangles, rendered remotely with OSPRay's scivis renderer using 64 Intel Xeon Phi Knight’s Landing nodes on Stampede2 and streamed to the POWERwall, averages 6-7 FPS

We present dw2, a flexible and easy-to-use software infrastructure for interactive rendering of large tiled display walls. Our library represents the tiled display wall as a single virtual screen through a display 'service', which renderers connect to and send image tiles to be displayed, either from an on-site or remote cluster. The display service can be easily configured to support a range of typical network and display hardware configurations; the client library provides a straightforward interface for easy integration into existing renderers. We evaluate the performance of our display wall service in different configurations using a CPU and GPU ray tracer, in both on-site and remote rendering scenarios using multiple display walls. Continue Reading

A Terminology for In Situ Visualization and Analysis Systems

2020-08-13T23:00:00-07:00

The term 'in situ processing' has evolved over the last decade to mean both a specific strategy for visualizing and analyzing data and an umbrella term for a processing paradigm. The resulting confusion makes it difficult for visualization and analysis scientists to communicate with each other and with their stakeholders. To address this problem, a group of over 50 experts convened with the goal of standardizing terminology. This paper summarizes their findings and proposes a new terminology for describing in situ systems. An important finding from this group was that in situ systems are best described via multiple, distinct axes: integration type, proximity, access, division of execution, operation controls, and output type. This paper discusses these axes, evaluates existing systems within the axes, and explores how currently used terms relate to the axes. Continue Reading

Using Hardware Ray Transforms to Accelerate Ray/Primitive Intersections for Long, Thin Primitive Types

2020-07-04T23:00:00-07:00

Fig 1:Three of the models we used for evaluating our method: Left) SciVis2011 contest data set (334.96K rounded cylinders via Quilez-style 'capsules', plus 315.88K triangles). Middle,Right) hair/fur on the Blender Foundation franck and autumn models (2.4M and 3.4M 'phantom' curve segments, plus 249.62K and 904.40K triangles, respectively). For these three models, our method leverages hardware ray transforms to realize a hardware-accelerated OBB culling test, achieving speedup of 1.3×, 2.0×, and 2.1×, respectively, over a traditional (but also hardware-accelerated) AABB-based BVH (both methods use exactly the same primitive intersection codes). Bottom: heat map of number of intersection program evaluations for the two methods, respectively.

With the recent addition of hardware ray tracing capabilities, GPUs have become incredibly efficient at ray tracing both triangular geometry, and instances thereof. However, the bounding volume hierarchies that current ray tracing hardware relies on are known to struggle with long, thin primitives like cylinders and curves, because the axis-aligned bounding boxes that these hierarchies rely on cannot tightly bound such primitives. In this paper, we evaluate the use of RTX ray tracing capabilities to accelerate these primitives by tricking the GPU's instancing units into executing a hardware-accelerated oriented bounding box (OBB) rejection test before calling the user’s intersection program. We show that this can be done with minimal changes to the intersection programs and demonstrate speedups of up to 5.9× on a variety of data sets. Continue Reading

CPU Ray Tracing of Tree-Based Adaptive Mesh Refinement Data

2020-03-24T23:00:00-07:00

Fig 1:High-fidelity visualization (volume and implicit isosurface rendering) of the NASA ExaJet dataset (field: vorticity). This dataset contains 656M cells (1.31B after instancing) of adaptive resolution and 63.2M triangles (126M after instancing). This 2400×600 image is rendered on a workstation with four Intel Xeon E7-8890 v3 CPUs (72 cores, 2.5 GHz) at a framerate of 6.64 FPS. We show that our system has the capability of ray tracing TB-AMR data in combination with advanced shading effects like ambient occlusion and path tracing.

Adaptive mesh refinement (AMR) techniques allow for representing a simulation’s computation domain in an adaptive fashion. Although these techniques have found widespread adoption in high-performance computing simulations, visualizing their data output interactively and without cracks or artifacts remains challenging. In this paper, we present an efficient solution for direct volume rendering and hybrid implicit isosurface ray tracing of tree-based AMR (TB-AMR) data. We propose a novel reconstruction strategy, Generalized Trilinear Interpolation (GTI), to interpolate across AMR level boundaries without cracks or discontinuities in the surface normal. We employ a general sparse octree structure supporting a wide range of AMR data, and use it to accelerate volume rendering, hybrid implicit isosurface rendering and value queries. We demonstrate that our approach achieves artifact-free isosurface and volume rendering and provides higher quality output images compared to existing methods at interactive rendering rates. Continue Reading

A Comparison of Rendering Techniques for 3D Line Sets with Transparency

2020-02-23T22:00:00-08:00

Fig 1:Strengths and weaknesses of transparent line rendering techniques. For each pair, the left image shows the ground truth (GT). Right images show (a) approximate blending using MLABDB, (b) opacity over-estimation of MBOIT, (c) reverse blending order of MLABDB, (d) blur effect of MBOIT. Speed-ups to GT rendering technique: (a) 7, (b) 2, (c) 3.5, (d) 4.5.

This paper presents a comprehensive study of rendering techniques for 3D line sets with transparency. The rendering of transparent lines is widely used for visualizing trajectories of tracer particles in flow fields. Transparency is then used to fade out lines deemed unimportant, based on, for instance, geometric properties or attributes defined along with them. Accurate blending of transparent lines requires rendering the lines in back-to-front or front-to-back order, yet enforcing this order for space-filling 3D line sets with extremely high-depth complexity becomes challenging. In this paper, we study CPU and GPU rendering techniques for transparent 3D line sets. We compare accurate and approximate techniques using optimized implementations and several benchmark data sets. We discuss the effects of data size and transparency on quality, performance, and memory consumption. Based on our study, we propose two improvements to per-pixel fragment lists and multi-layer alpha blending. The first improves the rendering speed via an improved GPU sorting operation, and the second improves rendering quality via transparency-based bucketing. Continue Reading

Scalable Ray Tracing Using the Distributed FrameBuffer

2019-07-09T23:00:00-07:00

Fig 1:Large-scale interactive visualization using the Distributed FrameBuffer. Top left: Image-parallel rendering of two transparent isosurfaces from the Richtmyer-Meshkov (516M triangles), 8FPS with a 2048² framebuffer using 16 Stampede2 Intel Xeon Platinum 8160 SKX nodes. Top right: Data-parallel rendering of the Cosmic Web (29B transparent spheres), 2FPS at 2048² using 128 Theta Intel Xeon Phi Knight's Landing (KNL) nodes. Bottom: Data-parallel rendering of the 951GB DNS volume combined with a transparent isosurface (4.35B triangles), 5FPS at 4096x1024 using 64 Stampede2 Intel Xeon Phi KNL nodes.

Image- and data-parallel rendering across multiple nodes on high-performance computing systems is widely used in visualization to provide higher frame rates, support large data sets, and render data in situ. Specifically for in situ visualization, reducing bottlenecks incurred by the visualization and compositing is of key concern to reduce the overall simulation runtime. Moreover, prior algorithms have been designed to support either image- or data-parallel rendering and impose restrictions on the data distribution, requiring different implementations for each configuration. In this paper, we introduce the Distributed FrameBuffer, an asynchronous image-processing framework for multi-node rendering. We demonstrate that our approach achieves performance superior to the state of the art for common use cases, while providing the flexibility to support a wide range of parallel rendering algorithms and data distributions. By building on this framework, we extend the open-source ray tracing library OSPRay with a data-distributed API, enabling its use in data-distributed and in situ visualization applications. Continue Reading

Ray Tracing Generalized Tube Primitives: Method and Applications

2019-07-09T23:00:00-07:00

Fig 1:Visualizations using our "generalized tube" primitives. (a): DTI tractography data, semi-transparent fixed-radius streamlines (218K line segments). (b): A generated neuron assembly test case, streamlines with varying radii and bifurcations (3.2M l. s.). (c): Aneurysm morphology, semi-transparent streamlines with varying radii and bifurcations (3.9K l. s.) and an opaque center line with fixed radius and bifurcations (3.9K l. s.). (d): A tornado simulation, with radius used to encode the velocity magnitude (3.56M l. s.). (e): Flow past a torus, fixed-radius pathlines (6.5M l. s.). Rendered at: (a) 0.38FPS, (b) 7.2FPS, (c) 0.25FPS, (d) 18.8FPS, with a 2048x2048 framebuffer; (e) 23FPS with a 2048x786 framebuffer. Performance measured on a dual Intel Xeon E5-2640 v4 workstation, with shadows and ambient occlusion.

We present a general high-performance technique for ray tracing generalized tube primitives. Our technique efficiently supports tube primitives with fixed and varying radii, general acyclic graph structures with bifurcations, and correct transparency with interior surface removal. Such tube primitives are widely used in scientific visualization to represent diffusion tensor imaging tractographies, neuron morphologies, and scalar or vector fields of 3D flow. We implement our approach within the OSPRay ray tracing framework, and evaluate it on a range of interactive visualization use cases of fixed- and varying-radius streamlines, pathlines, complex neuron morphologies, and brain tractographies. Our proposed approach provides interactive, high-quality rendering, with low memory overhead. Continue Reading

Efficient Space Skipping and Adaptive Sampling of Unstructured Volumes Using Hardware Accelerated Ray Tracing

2019-10-19T23:00:00-07:00

Fig 1:Performance improvement of our method on the 278 million tetrahedra Japan Earthquake data set. (a) A reference volume ray marcher without our method, at 0.9 FPS (1024² pixels) on an NVIDIA RTX 8000 GPU. (b) A heat map of relative cost per-pixel in (a). (c) and (d), the same, but now with our space skipping and adaptive sampling method, running at 7 FPS (7× faster).

Sample based ray marching is an effective method for direct volume rendering of unstructured meshes. However, sampling such meshes remains expensive, and strategies to reduce the number of samples taken have received relatively little attention. In this paper, we introduce a method for rendering unstructured meshes using a combination of a coarse spatial acceleration structure and hardware-accelerated ray tracing. Our approach enables efficient empty space skipping and adaptive sampling of unstructured meshes, and outperforms a reference ray marcher by up to 7× Continue Reading

RTX Beyond Ray Tracing: Exploring the Use of Hardware Ray Tracing Cores for Tet-Mesh Point Location

2019-07-07T23:00:00-07:00

Fig 1:a-c) Illustrations of the tetrahedral mesh point location kernels evaluated in this paper. a) Our reference method builds a BVH over the tets and performs both BVH traversal and point-in-tet tests in software (black) using CUDA. b) rtx-bvh uses an RTX-accelerated BVH over tets and triggers hardware BVH traversal (green) by tracing infinitesimal rays at the sample points, while still performing point-tet tests in software (black). c) rtx-rep-faces and rtx-shrd-faces use both hardware BVH traversal and triangle intersection (green) by tracing rays against the tetrahedras' faces. d) An image from the unstructured-data volume ray marcher used to evaluate our point location kernels, showing the 35.7M tet Agulhas Current data set rendered interactively on an NVIDIA TITAN RTX (34FPS at 1024² pixels)

We explore a first proof-of-concept example of creatively using the Turing generation's hardware ray tracing cores to solve a problem other than classical ray tracing, specifically, point location in unstructured tetrahedral meshes. Starting with a CUDA reference method, we describe and evaluate three different approaches to reformulate this problem in a manner that allows it to be mapped to these new hardware units. Each variant replaces the simpler problem of point queries with the more complex one of ray queries; however, thanks to hardware acceleration, these approaches are actually faster than the reference method. Continue Reading

Spatially-aware Parallel I/O for Particle Data

2019-08-04T23:00:00-07:00

Fig 1:An illustration of our two-phase I/O approach, which takes spatial locality into consideration.

Particle data are used across a diverse set of large scale simulations, for example, in cosmology, molecular dynamics and combustion. At scale these applications generate tremendous amounts of data, which is often saved in an unstructured format that does not preserve spatial locality; resulting in poor read performance for post-processing analysis and visualization tasks, which typically make spatial queries. In this work, we explore some of the challenges of large scale particle data management, and introduce new techniques to perform scalable, spatially-aware write and read operations. We propose an adaptive aggregation technique to improve the performance of data aggregation, for both uniform and non-uniform particle distributions. Furthermore, we enable efficient read operations by employing a level of detail re-ordering and a multi-resolution layout. Finally, we demonstrate the scalability of our techniques with experiments on large scale simulation workloads up to 256K cores on two different leadership supercomputers, Mira and Theta. Continue Reading

CPU Isosurface Ray Tracing of Adaptive Mesh Refinement Data

2018-12-31T22:00:00-08:00

Fig 1:High-fidelity isosurface visualizations of gigascale block-structured adaptive mesh refinement (BS-AMR) data using our method. Left: a 28GB GR-Chombo simulation of gravitational waves resulting from the collision of two black holes. Middle and Right: a 57GB AMR dataset computed with LAVA at NASA, simulating multiple fields over the landing gear of an aircraft. Middle: isosurface representation of the vorticity, rendered with path tracing. Right: a combined visualization of volume rending and an isosurface of the pressure over the landing gear, rendered with OSPRay's SciVis renderer. Using our approach for ray tracing such AMR data, we can interactively render crack-free implicit isosurfaces in combination with direct volume rendering and advanced shading effects like transparency, ambient occlusion and path tracing.

Adaptive mesh refinement (AMR) is a key technology for large-scale simulations that allows for adaptively changing the simulation mesh resolution, resulting in significant computational and storage savings. However, visualizing such AMR data poses a significant challenge due to the difficulties introduced by the hierarchical representation when reconstructing continuous field values. In this paper, we detail a comprehensive solution for interactive isosurface rendering of block-structured AMR data. We contribute a novel reconstruction strategy—the octant method—which is continuous, adaptive and simple to implement. Furthermore, we present a generally applicable hybrid implicit isosurface ray-tracing method, which provides better rendering quality and performance than the built-in sampling-based approach in OSPRay. Finally, we integrate our octant method and hybrid isosurface geometry into OSPRay as a module, providing the ability to create high-quality interactive visualizations combining volume and isosurface representations of BS-AMR data. We evaluate the rendering performance, memory consumption and quality of our method on two gigascale block-structured AMR datasets. Continue Reading

A Virtual Reality Visualization Tool for Neuron Tracing

2017-12-31T22:00:00-08:00

Fig 1:A screenshot of our VR neuron tracing tool using the isosurface rendering mode. The dark gray floor represents the extent of the tracked space. Users can orient themselves in the dataset via the minimap (right), which shows the world extent in blue, the current focus region in orange, and the previously traced neuronal structures. The focus region is displayed in the center of the space. The 3D interaction and visualization provides an intuitive environment for exploring the data and a natural interface for neuron tracing, resulting in faster, high-quality traces with less fatigue reported by users compared to existing 2D tools.

Tracing neurons in large-scale microscopy data is crucial to establishing a wiring diagram of the brain, which is needed to understand how neural circuits in the brain process information and generate behavior. Automatic techniques often fail for large and complex datasets, and connectomics researchers may spend weeks or months manually tracing neurons using 2D image stacks. We present a design study of a new virtual reality (VR) system, developed in collaboration with trained neuroanatomists, to trace neurons in microscope scans of the visual cortex of primates. We hypothesize that using consumer-grade VR technology to interact with neurons directly in 3D will help neuroscientists better resolve complex cases and enable them to trace neurons faster and with less physical and mental strain. We discuss both the design process and technical challenges in developing an interactive system to navigate and manipulate terabyte-sized image volumes in VR. Using a number of different datasets, we demonstrate that, compared to widely used commercial software, consumer-grade VR presents a promising alternative for scientists. Continue Reading

libIS: A Lightweight Library for Flexible In Transit Visualization

2018-11-11T22:00:00-08:00

Fig 1:Interactive in situ visualization of a 172k atom simulation of silicene formation with 128 LAMMPS ranks sending to 16 OSPRay renderer ranks, all executed on Theta in the mpi-multi configuration. When taking four ambient occlusion samples per-pixel, our viewer averages 7FPS at 1024x1024. Simulation dataset is courtesy of Cherukara et al.

As simulations grow in scale, the need for in situ analysis methods to handle the large data produced grows correspondingly. One desirable approach to in situ visualization is in transit visualization. By decoupling the simulation and visualization code, in transit approaches alleviate common difficulties with regard to the scalability of the analysis, ease of integration, usability, and impact on the simulation. We present libIS, a lightweight, flexible library which lowers the bar for using in transit visualization. Our library works on the concept of abstract regions of space containing data, which are transferred from the simulation to the visualization clients upon request, using a client-server model. We also provide a SENSEI analysis adaptor, which allows for transparent deployment of in transit visualization. We demonstrate the flexibility of our approach on batch analysis and interactive visualization use cases on different HPC resources. Continue Reading

VisIt-OSPRay: Toward an Exascale Volume Visualization System

2018-06-03T23:00:00-07:00

Fig 1:High-quality interactive volume visualization using VisIt-OSPRay: a) volume rendering of O₂ concentration inside a combustion chamber, data courtesy of the University of Utah CCMSC; b) volume rendering of the Richtmyer-Meshkov Instability; c) visualization of a supernova simulation; d) visualization of the aneurysm dataset using volume rendering and streamlines; e) scalable volume rendering of the 966GB DNS data on 64 Stampede2 Intel Xeon Phi Knight's Landing nodes.

Large-scale simulations can easily produce data in excess of what can be efficiently visualized using production visualization software, making it challenging for scientists to gain insights from the results of these simulations. This trend is expected to grow with exascale. To meet this challenge, and run on the highly parallel hardware being deployed on HPC system, rendering systems in production visualization software must be redesigned to perform well at these new scales and levels of parallelism. In this work, we present VisIt-OSPRay, a high-performance, scalable, hybrid-parallel rendering system in VisIt, using OSPRay and IceT, coupled with PIDX for scalable I/O. We examine the scalability and memory efficiency of this system and investigate further areas for improvement to prepare VisIt for upcoming exascale workloads. Continue Reading

Scalable Data Management of the Uintah Simulation Framework for Next-Generation Engineering Problems with Radiation

2018-03-19T23:00:00-07:00

The need to scale next-generation industrial engineering problems to the largest computational platforms presents unique challenges. This paper focuses on data management related problems faced by the Uintah simulation framework at a production scale of 260K processes. Uintah provides a highly scalable asynchronous many-task runtime system, which in this work is used for the modeling of a 1000 megawatt electric (MWe) ultra-supercritical (USC) coal boiler. At 260K processes, we faced both parallel I/O and visualization related challenges, e.g., the default file-per-process I/O approach of Uintah did not scale on Mira. In this paper we present a simple to implement, restructuring based parallel I/O technique. We impose a restructuring step that alters the distribution of data among processes. The goal is to distribute the dataset such that each process holds a larger chunk of data, which is then written to a file independently. This approach finds a middle ground between two of the most common parallel I/O schemes–file per process I/O and shared file I/O–in terms of both the total number of generated files, and the extent of communication involved during the data aggregation phase. To address scalability issues when visualizing the simulation data, we developed a lightweight renderer using OSPRay, which allows scientists to visualize the data interactively at high quality and make production movies. Finally, this work presents a highly efficient and scalable radiation model based on the sweeping method, which significantly outperforms previous approaches in Uintah, like discrete ordinates. The integrated approach allowed the USC boiler problem to run on 260K CPU cores on Mira. Continue Reading

CPU Volume Rendering of Adaptive Mesh Refinement Data

2017-11-26T22:00:00-08:00

Fig 1:Two examples of our method (integrated within the OSPRay ray tracer): Left: 1.8GB Cosmos AMR data, rendered in ParaView. Right: a 57GB NASA Chombo simulation, rendered with ambient occlusion and shadows alongside mesh geometry.

Adaptive Mesh Refinement (AMR) methods are widespread in scientific computing, and visualizing the resulting data with efficient and accurate rendering methods can be vital for enabling interactive data exploration. In this work, we detail a comprehensive solution for directly volume rendering block-structured (Berger-Colella) AMR data in the OSPRay interactive CPU ray tracing framework. In particular, we contribute a general method for representing and traversing AMR data using a kd-tree structure, and four different reconstruction options, one of which in particular (the basis function approach) is novel compared to existing methods. We demonstrate our system on two types of block-structured AMR data and compressed scalar field data, and show how it can be easily used in existing production-ready applications through a prototypical integration in the widely used visualization program ParaView. Continue Reading

Progressive CPU Volume Rendering with Sample Accumulation

2017-06-11T23:00:00-07:00

Fig 1:(a-c) Progressive refinement with Sample-Accumulation Volume Rendering (SAVR) on the 40GB Landing Gear AMR dataset using a prototype AMR sampler. The SAVR algorithm correctly accumulates frames to progressively refine the image. After 16 frames of accumulation the volume is sampled at the Nyquist limit, with some small noise, by 32 frames the noise has been removed. SAVR extends to distributed data, in (d) we show the 1TB DNS dataset, a 10240×7680×1536 uniform grid, rendered interactively across 64 second-generation Intel Xeon Phi "Knights Landing" (KNL) processor nodes on Stampede 1.5 at a 6144×1024 resolution. While interacting, our method achieves around 5.73 FPS.

We present a new method for progressive volume rendering by accumulating object-space samples over successively rendered frames. Existing methods for progressive refinement either use image space methods or average pixels over frames, which can blur features or integrate incorrectly with respect to depth. Our approach stores samples along each ray, accumulates new samples each frame into a buffer, and progressively interleaves and integrates these samples. Though this process requires additional memory, it ensures interactivity and is well suited for CPU architectures with large memory and cache. This approach also extends well to distributed rendering in cluster environments. We implement this technique in Intel’s open source OSPRay CPU ray tracing framework and demonstrate that it is particularly useful for rendering volumetric data with costly sampling functions. Continue Reading

In Situ Exploration of Particle Simulations with CPU Ray Tracing

2016-09-30T23:00:00-07:00

Fig 1:A coal particle combustion simulation in Uintah at three different timesteps with (left to right): 34.61M, 48.46M and 55.39M particles, with attribute based culling showing the full jet (top) and the front in detail (bottom). Using our in situ library to query and send data to our rendering client in OSPRay these images are rendered interactively with ambient occlusion, averaging around 13 FPS at 1920×1080. The renderer is run on 12 nodes of the Stampede supercomputer and pulls data from a Uintah simulation running on 64 processes (4 nodes). Our loosely-coupled in situ approach allows for live exploration at the full temporal fidelity of the simulation, without prohibitive IO cost.

We present a system for interactive in situ visualization of large particle simulations, suitable for general CPU-based HPC architectures. As simulations grow in scale, in situ methods are needed to alleviate IO bottlenecks and visualize data at full spatio-temporal resolution. We use a lightweight loosely-coupled layer serving distributed data from the simulation to a data-parallel renderer running in separate processes. Leveraging the OSPRay ray tracing framework for visualization and balanced P-k-d trees, we can render simulation data in real-time, as they arrive, with negligible memory overhead. This flexible solution allows users to perform exploratory in situ visualization on the same computational resources as the simulation code, on dedicated visualization clusters or remote workstations, via a standalone rendering client that can be connected or disconnected as needed. We evaluate this system on simulations with up to 227M particles in the LAMMPS and Uintah computational frameworks, and show that our approach provides many of the advantages of tightly-coupled systems, with the flexibility to render on a wide variety of remote and co-processing resources. Continue Reading

VTK-m: Accelerating the Visualization Toolkit for Massively Threaded Architectures

2016-05-08T23:00:00-07:00

One of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture. Continue Reading

CPU Ray Tracing Large Particle Data with Balanced P-k-d Trees

2015-10-24T23:00:00-07:00

Fig 1:Full-detail ray tracing of giga-particle data sets. From left to right: CosmicWeb early universe data set from a P3D simulation with 29 billion particles; a 100 million atom molecular dynamics Al₂O₃−SiC materials fracture simulation; and a 1.3 billion particle Uintah MPM detonation simulation. Using a quad-socket, 72-core 2.5GHz Intel Xeon E7-8890 v3 Processor with 3TB RAM and path-tracing with progressive refinement at 1 sample per pixel, these far and close images (above and below) are rendered at 1.6 (far) / 1.0 (close) FPS (left), 2.0 / 1.2 FPS (center), and 1.0 / 0.9 FPS (right), respectively, at 4K (3840×2160) resolution. All examples use our balanced P-k-d tree, an acceleration structure which requires little or no memory cost beyond the original data.

We present a novel approach to rendering large particle data sets from molecular dynamics, astrophysics and other sources. We employ a new data structure adapted from the original balanced k-d tree, which allows for representation of data with trivial or no overhead. In the OSPRay visualization framework, we have developed an efficient CPU algorithm for traversing, classifying and ray tracing these data. Our approach is able to render up to billions of particles on a typical workstation, purely on the CPU, without any approximations or level-of-detail techniques, and optionally with attribute-based color mapping, dynamic range query, and advanced lighting models such as ambient occlusion and path tracing. Continue Reading