Low Latency Rendering Acceleration With Multiple Graphics Pipes

Bill Baxter
Department of Computer Science
University of North Carolina at Chapel Hill
May 8, 1999

Abstract

We present two techniques for dramatically accelerating the rendering of complex models on high end graphics hardware through the use of parallel graphics pipes. One technique is image based, the other polygon based. Both techniques incur a cost in terms of visibility artifacts in order to achieve acceleration. The nature of the trade off between quality and speed is examined.

Motivation

A number of applications depend on the interactive display of models: from entertainment to medical analysis to CAD/CAM design. In the former, designers carefully craft their data to be simple enough to render at interactive frame rates, but in other domains, the data to be viewed is either given a priori or crafted according to specifications that do not include rendering speed. The complexity of these models often outstrips the capabilities of current hardware to display them interactively.

Many approaches to accelerate the rendering of massive models have been explored in the literature. One widely exploited technique is the use of image based impostors which stand in for large amounts of geometry. Billboards [ref], sprites [ref], portal textures [Alia97], and layered depth images (LDIs) [Shade99] are all examples of such techniques. Another large body of research exists on the generation and use of simplified geometric levels of detail (LODs) [ref]. But these techniques typically require a great deal of preprocessing which on a large model can take hours to perform and require gigabytes of additional secondary storage [e.g. Alia99, Alia99A], most of which will not be used if the interactive user does not happen to visit that particular region of the model. Even with the hours of preprocessing, the frame rate or latency involved in rendering the simplified model may not be appropriate for a viewing environment such as a head-tracked head-mounted display (HMD) which requires both high frame rates (~20 FPS) and low latency (~50 msec).

SGI's Onyx2 Reality Monster with the DPLEX hardware option enables the multiplexing of frames rendered on different graphics pipes onto a single video output device. Pipelining frames in this way increases frame rate, but does not improve latency.

The works to which the present bears the strongest resemblance are the post-rendering warping presented in [Mark99] and the visibility server of [Rask97]. Both discuss ways to use a powerful graphics server to accelerate the rendering on a weak client across a TCP/IP network. This work adapts concepts from both for use in a highly connected symmetric processing environment where a client is just as powerful as a server and a very high-bandwidth, low-latency shared memory connects the two.

System Architecture

The rendering system developed as part of this research has a simple underlying structure as shown in Figure 1. The client (Interactive Renderer) is connected to the servers (Reference Renderers) via a pair of queues (managed by a Selection Strategy). The reference renderers' main loop consists of requesting a viewpoint (Camera) to render, rendering it, and then submitting that rendered viewpoint to the queue. A client's main loop, on the other hand, consists of asking for a selection of reference views that best represent the current viewpoint, using those reference viewpoints to render the current view, and then sending an update of the user's current viewpoint.

Figure 1 - System Architecture
Each gray box represents a separate process. The Selection Strategy resides in an address space
shared by all of them.

Both acceleration techniques developed in this research fit nicely into the system architecture described above. They differ only in the type of reference view passed from servers to client. The following sections describe the techniques.

Image Based Acceleration

In the first technique, a reference view consists of the color and depth buffers rendered by a server. These are interpreted as a grid of point samples which are reprojected from a new viewpoint in the interactive renderer. This is very similar to [Mark99] except that there the focus is on rendering remotely on an inexpensive client, so the reprojection is done using general purpose processors. Here we do not have such a restriction on the client and can take full advantage of its fast point-rendering hardware. Furthermore the client can recieve reference frames from many servers, rather than just one.

This approach suffers from all of the same problems that other image based techniques do, namely artifacts from disocclusion and surface undersampling. Approaches taken to alleviate the disocclusion artifacts are discussed in Strategy Choices. The surface undersampling problem is currently handled with fixed-size splats for simplicity.

Polygon Based Acceleration

The reference viewpoint need not be limited to precomputed color and depth samples. In fact the assumption of a fast shared memory system means that the client has just as much access to the original model as do servers. Instead of sending a color and depth value for every pixel in the reference image, one can send an identifier for the polygon that was visible at that pixel. Alternatively one could mark nodes in the scene graph directly. [Rask97] describes a TCP/IP client/server system which uses this same sort of trick. Here, though, the goal is to exploit parallelism in graphics hardware to perform near-perfect visibility culling.

Using parallel pipes to precompute visibility has some enormous advantages. For one, surface reconstruction is easier than with point samples since most of the time two different polygons sampled by adjacent pixels will actually be adjacent polygons. That means the surface does not disintegrate as the user zooms in closer to the surface than the reference camera. Another advantage is compactness: often a polygon will cover multiple pixels in the reference image, so one polygon identifier can often serve to represent many pixels. Fortunately, on the flip side, when one pixel covers multiple polygons only one of the polygons will be put into the list, so complexity is still capped to the size of the reference grid just as it is with image based techniques. A hybrid reference view that uses point samples to maintain coverage in those cases would be interesting. Finally since the interactive graphics pipe will be rendering actual polygons from the model, it can render the scene with realistic view-dependent lighting. This is possible with point samples only if a normal buffer is added to the color and depth buffers already being transferred. A full precision normal buffer, however, would require as much storage as three depth buffers.

In the current implementation, each reference pipe renders the scene using triangle-IDs encoded into the RGB color channels. After rendering the false color image, the challenge is to quickly generate a compact list of unique IDs out of the highly redundant information in the color buffer. The current solution is to scan through the color buffer hashing each ID into a list to remove redundancies, and then scanning through the resulting hash table to compact the list. Only then is the list handed off to the interactive renderer. It is essential that the reference renderer receives the most easily digestible list possible, since this directly impact the interactive frame rate. Note that each reference pipe runs on a separate processor, so CPU time spent compacting the list does not take away from time available to the interactive renderer.

The hash and compact technique works well; however, it does not allow for smooth integration of multiple reference renderers. It is critical to be able to combine the ID lists from multiple pipes, and ideally the cost of such a merging process would not depend on the number of pipes contributing IDs. It is possible to simply let all the processes hash into the same list simultaneously. Although this can result in some duplicates in the hash table from race conditions, the number of duplicates will be relatively small and there is no real harm in accidentally rendering a few triangles multiple times. However, at some point triangles which are no longer in the user's view must be cleared out, otherwise they will just continue to accumulate until the exceed the interactive pipe's ability to render them. One could simply clear the hash table periodically, but this would result in an unnecessary loss of recently viewed polygons. Locality of reference applies here: a polygon recently viewed is likely to be viewed again, so as long as the interactive pipe can handle the polygon load, there is no reason to clear out a recently viewed polygon. The solution developed here is to associate a timestamp with each ID in the hash table, and then instead of completely clearing the table, it is pruned periodically according to an LRU rule. Exactly when this pruning occurs can be based on the current frame rate in the interactive renderer. When the frame rate dips below a specified threshold, the pruning kicks in, which boosts the frame rate back up to acceptable levels.

The polygon ID rendering acceleration technique yields very promising results; nevertheless, it is not without its share of problems. One problem is undersampling. When the scene contains many small polygons it is very easy for some of them to entirely miss the pixel grid in the reference renderers, especially when the small polygons are nearly edge-on to the viewer. Using the accumulate-and-prune technique just described tends to eventually fill in the missing polygons, but still some polygons can stay missing for the time it takes to renderer several reference frames.

Strategy Choices

The most difficult part in the implementation of this system is the Selection Strategy which must decide how to manage ever changing sets of cameras and reference views. All of the details of processor allocation and resource management are contained within the selection strategy. Many variations are possible within the interface provided to meet the ultimate goal of providing the interactive graphics pipe with reference data at a sufficient density over its entire viewing frustum for every frame. Undersampling can be completely avoided if the path of the user can be predicted perfectly. If it were known where the user would be at all future points in time then a scheduling algorithm could be used to make sure that the appropriate reference view is ready when needed. Unfortunately if this is true, then the display of the model is no longer interactive -- a prerendered video would suffice. It is the inherent unpredictability of a user's motions that makes the experience interactive. Still, undersampling artifacts can be reduced in a number of ways by assuming certain coherence in the user's motions. One can predict a user's future positions based on current trajectories, or render a somewhat larger frustum than what the user can see in any one frame. Yet another approach is to speculatively render references views from several possible future paths and render just the best available references at the last instant when the user's choice of paths is known.

We have implemented several of the ideas mentioned above and have plans to implement several more (see Future Work).

The test bed developed for experimenting with strategies contains a number of parameters that are common to all and can be set with the GUI, shown below.

The key parameters are the

Point reconstruction kernel size (PtSize)
Horizontal field of view (x fov)
Vertical field of view (y fov)
Horizontal resolution (XResolution)
Vertical resolution (YResolution)
Camera Z offsets

Several of these can be set independently for various reference rendering pipes in the "fan" strategy described below.

Basic Selection Strategy

In the most basic selection strategy the both the camera queue and the reference queue have only one slot. This cuts down on the amount of parallelism in the graphics hardware that can be exploited. Every reference renderer just overwrites the reference view left by the last one. This is a waste of effort in some cases, or worse, if the reference renderers have variable frame rates then a reference rendered from a stale camera could overwrite a fresh one. This particular problem can be fixed by including a logical timestamp with cameras that carries through to the reference views, but that does not help the poor utilization of parallel resources. This strategy basically serves as a standard against which other strategies can be judged. One can experiment with the performance and quality of various combinations of reference image size and FOV.

Fan Selection Strategy

In this strategy, all reference renderers render the scene from the same viewpoint origin but each renders a different viewing direction as shown in the figure below. The main point of this strategy is to demonstrate how multiple pipes can be used to overcome the problem of seeing the "edge of the world" whenever one turns around or backs up suddenly. Fanning out the angle rendered by each server pipe helps to increase the total FOV, which makes it less likely that a sudden turn will reveal the edge of the available reference data.

Figure n - Fanning reference renderers.
The diagram shows an overhead view
of the frusta rendered by two
reference renderers with 30 degree
separation.

Figure n - Fan Selection Strategy.
Notice the outline of the two reference
images that were used to generate this
view. (Effect deliberately exaggerated
for explicative purposes)

One observation is that off-center references are not visible much of the time, yet they cost just as much to render as the one at the center of attention. They are rendered speculatively in case the user suddenly decides to steer the walkthrough in that direction. In order to speed up the interactive frame rate, however, we can optimize for the common case and render a lower resolution reference image for the off-center angles. If we only wish to use two reference images, then we can point them both in the same direction, but render one with a lower resolution and wider FOV. The image below demonstrates that technique.

Figure n - Use of a lower resolution reference with wide FOV.
The viewer was turning right as this image was captured, so a band of
lower resolution samples can be seen along the right edge.
The blocky pixels at the top of the window are an artifact of the
technique used to composite the low and high resolution reference views.

We still get the edge coverage of a large reference FOV, without taking the hit of rendering 2-3 full resolution point sets. To composite the low resolution and high resolution images - which cover many of the same pixels - we give preference to high resolution samples by using the stencil buffer. This works well generally, but causes some problems with detailed elements that appear in front of the background, as is the case with the window in the above image. In the example above the problem can be remedied by adding a backdrop image behind the window. This is also desirable for increasing the realism of the experience. A promising variation which would eliminate all "edge of the world" phenomena would be to take the wide FOV to its extreme and use a low resolution environment map pasted onto the sides of a cube to provide coverage of the entire solid angle surrounding the user.

Predictive Strategy

A key problem with both of the previous strategies is that they do not anticipate the movements of the user. While it is impossible to predict the user's movements perfectly, one can reduce the steady state error by extrapolating the user's trajectory out one reference frame time into the future. Both of the above strategies could benefit from such prediction. For example if the user's viewpoint were being predicted in the image above, there would be very little low resolution image showing. For the purposes of demonstration, the basic predictive strategy was implemented as a variation of the basic strategy. [Mark99] shows good results with a predictive strategy in which interactive views are generated from two reference images: one predicted future viewpoint and one past viewpoint.

In this work, references can be generated by multiple pipes, so the Selection Strategy can speculatively render a range of potential future viewing directions and select the best of the bunch when the time comes. This is not necessarily an efficient use of the parallel hardware, but at least with a real headtracked viewer the number of potential future viewing directions is fairly limited and not dependent on the problem size. A small number (less than 10) reference renderers should give good results regardless of the model size.

Results

The system described above was implemented on an Onyx2 with 32 MIPS R10K processors running at 250MHz and 8 InfiniteReality2E graphics pipes. This so-called "Reality Monster" is quite likely the only architecture in the world currently capable of transferring the contents of frame buffers between pipes fast enough to be useful. Tests indicate that the Reality Monster can transfer an astounding ???MB/sec from the frame buffer of one IR2, to a shared main memory, and finally into the frame buffer of a second IR2. Although this configuration is cutting edge today, Moore's law indicates that a machine of this calibre will not be such a rarefied commodity a decade from now. And even now, there are situations in which an interactive frame rate must be achieved no matter what the cost.

The tests were conducted on a radiositized model of a few rooms of Dr. Brook's house which was instanced three times for a total of 390K triangles. The model was stored in a flat triangle list and could be rendered without any culling at about 4 frames per second (FPS). This base frame rate is both the base against which to compare acceleration, and also the rate at which a reference pipe can deliver new reference frames.

Using the point-based technique the rendering time depends upon the number of points displayed, which works against the fan strategy. Point rendering should be much faster than triangle rendering, since it requires no edge setup, but unfortunately current hardware is not well optimized for point rendering. Storing the colors for the points in a texture and accessing them via projective texturing almost doubles the frame rate. A grid of 512x512 points colored with projective texturing can be rendered at 17FPS. With two 512x512 references, 10FPS. The combined low-res/high-res technique used one 512x512 reference and one 128x128, and rendered at ~15FPS.

Using the polygon-ID-based technique yielded frame rates as high as 60FPS, probably capped by the display's refresh rate. Excellent results were obtained by setting the minimal frame rate to trigger pruning at about 20FPS. On a 640x480 stereo display (1280x480 total framebuffer), the interactive pipe was able to run at a consistent 30FPS.

While both techniques look promising, for models with significantly less than 1 polygon per pixel in an average view, and until point rendering hardware catches up to triangle hardware, rendering polygon IDs is probably the best choice.

Future Work

To actually implement the predictive strategy described above. I know what to do: just use the velocity vectors handed back from the tracker and use simple linear extrapolation for prediction
Hybrid techniques using both polygon ID's for quality and point samples (low-res?) to fill in gaps.
Better reconstruction technuques for point samples.
More models. Bigger models. Models with specular lighting. Bigger models that require several seconds to render a single reference will require more complex database management and prefetching. If it takes you 20 seconds to render a frame, you don't discard it lightly.
Try to quantify the badness of various artifacts (without falling into the quagmire of quantifying human perception)
Fix some of the GUI mode switching bugs that cause segfaults and bad memory leaks so I'm not humiliated on demo day.

Bibliography

[Mark99] William R. Mark, "Post-Rendering 3D Image Warping: Visibility, Reconstruction, and Performance for Depth-Image Warping", UNC TR# 99-022 (Dissertation), April 21, 1999.

[Rask97] R. Raskar, "Visibility Culling Using Frame Coherence", UNC Course Project, May 1997. (link withheld by author's request)

[DPLEX] Mark Schwenden, "Onyx2 (TM) DPLEX Option Hardware User's Guide", Document Number 007-3849-001, Silicon Graphics, Inc., 1999.

[Alia99] Daniel G. Aliaga, Anselmo A. Lastra, "Automatic Image Placement to Provide a Guaranteed Frame Rate", to appear in SIGGRAPH '99, 1999.

[Alia99A] D. Aliaga, J. Cohen, A. Wilson, E. Baker, H. Zhang, C. Erikson, K. Hoff, T. Hudson, W. Stuerzlinger, R. Bastos, M. Whitton, F. Brooks, D. Manocha, "MMR: An Integrated Massive Model Rendering System Using Geometric and Image-Based Acceleration", Symposium on Interactive 3D Graphics (I3D), April, 1999.

[Shade99] J. Shade, Gortler, He, Szeliski, "Layered Depth Images", Proceedings of SIGGRPAH '98.

[Alia97] Daniel G. Aliaga, Anselmo A. Lastra, "Architectural Walkthroughs Using Portal Textures", IEEE Visualization '97, pp. 355-362, Oct 19-24, 1997.

Matthew M. Rafferty, Daniel G. Aliaga, Voicu Popescu, Anselmo A. Lastra, "Images for Accelerating Architectural Walkthroughs", Computer Graphics & Applications, November/December, 1998.

Voicu Popescu, Anselmo Lastra, Daniel Aliaga, Manuel Oliveira Neto, "Efficient Warping for Architectural Walkthroughs using Layered Depth Images", IEEE Visualization '98, October 18-23, 1998.

Matthew M. Rafferty, Daniel G. Aliaga, Anselmo A. Lastra, "3D Image Warping in Architectural Walkthroughs", VRAIS '98, pp. 228-233, March 14-18, 1998.

Daniel G. Aliaga, Anselmo A. Lastra, "Smooth Transitions in Texture-based Simplification", Computer & Graphics, Elsevier Science, Vol 22:1, pp. 71-81, 1998.

Daniel G. Aliaga, "Visualization of Complex Models Using Dynamic Texture-based Simplification", IEEE Visualization '96, pp. 101-106, Oct 27-Nov 1, 1996.

D. Aliaga, J. Cohen, A. Wilson, H. Zhang, C. Erikson, K. Hoff, T. Hudson, W. Stuerzlinger, E. Baker, R. Bastos, M. Whitton, F. Brooks, D. Manocha, "A Framework for the Real-time Walkthrough of Massive Models", UNC TR# 98-013, March, 1998.

D. Aliaga, "Automatically Reducing and Bounding Geometric Complexity by Using Images", Dissertation,Computer Science, UNC-Chapel Hill, October 1998.