A 3D Scene is Worth 1000 Splats

/images/posts/gaussian-splatting/hero.png

A 3D Scene is Worth 1000 Splats

by Doru Arfire, Feb 1, 2024

Ever since they came onto the computer graphics and computer vision scene, NeRFs (Neural Radiance Fields) have aroused much interest for existing and novel applications. This article presents the latest approach in Radiance Fields, called 3D Gaussian Splatting (B. Kherbl et al, 2023), how it differs from NeRFs and what opportunities and challenges it presents.

But what’s a NeRF?

Given a collection of images of a 3D scene and the location, orientation and parameters of the cameras that took these images, a Radiance Field encodes it so that novel views can be synthesized. In other words, you can build a 3D model of the scene from a relatively small number of images, much smaller than in traditional photogrammetry. Moreover, this model can encode, apart from 3D information, texture and lighting information. And using this model you can view and render the scene from new positions and orientations.

The applications for Radiance Field abound, from faster photogrammetry to manufacturing, entertainment, healthcare and education.

While Radiance Fields are, for the most part, a mathematical concept, NeRF is a method of implementing it in practice. In short, a NeRF is a neural network that is trained to learn and reproduce a Radiance Field. Its input is a 3D point and a direction and its result is the color and density at that 3D point, seen from the given direction.

A NeRF’s internal structure is relatively simple judging by contemporary neural network architecture standards, and it is trained by sampling 3D points in the volume of interest and pixels in the input images - the end goal being to minimize the difference between the original and projected pixel values. The trained model thus associates a color and density to each point in a given 3D volume that encompasses the scene of interest.

But while NeRFs give impressive results in synthesizing novel views of a 3D scene, they have their drawbacks. First, they are costly to train: the initial NeRF model had training times of days, depending on the GPUs used. While improvements have been made, including a model (InstantNGP, T. Muller & all) that started to yield acceptable results in minutes, there was always a trade-off between training time and reproduction quality.

Rendering from a NeRF model involves, for each pixel, searching along the pixel ray and discovering the 3D point with the highest density value.

Second, rendering a view using a trained model is slow and inefficient, and again trade-offs have to be made to achieve real-time rendering.

To simplify, both slow training and slow rendering have their root cause in how NeRFs work: the fact that a NeRF is trained on 3D points selected in a 3D volume means that a lot of training effort is wasted on empty space.

On the other hand, rendering is slow because for each pixel you need to query the NeRF model about various points along the ray that passes through that pixel and discover which point actually has density (see illustration). While heuristics have been proposed, this method is still costly and the rendering time can vary widely from pixel to pixel. It also doesn’t play nicely with the parallel hardware we use to render 3D graphics.

So what’s this I hear about 3D gaussian splatting?

First let’s clarify the terms. Splatting is a time-tested technique in computer graphics. It involves combining textures with various opacity values to obtain novel effects.

Texture splatting(image from wikipedia)

Secondly, a 3D gaussian is a parameterized probability distribution for 3D points. Without going into too much details, they have the shape of 3D fuzzy ellipsoids that describe the probability that a point will be found at a certain location in 3D space. In the context of the technique we’re presenting here 3D gaussians don’t represent probabilities, but rather 3D graphics primitives that are combined to represent a 3D scene.

To sum up, the main innovation that the paper brings is representing the 3D scene as a collection of 3D gaussian distributions which are combined using splatting. Each has its position, scaling, orientation and color, and can be efficiently rendered by existing video hardware.

Gaussian splatting.

Another big difference between this approach and NeRFs is it doesn’t use any neural components, rather it trains and optimizes the parameter of the set of 3D gaussians meant to represent the scene. This also explains the use of 3D gaussians as rendering primitives: their parameters (position, rotation, scaling, opacity and color) are differentiable, meaning we can nudge them based on the difference between the predicted and target output.

And to go a bit deeper into how it is trained:

Select one of the source images
Project and render the current set of 3D gaussians on the camera at the corresponding view
Compute a loss function and adjust the 3D gaussian parameters using stochastic gradient descent, much like we train neural networks

Of course, there are extra details, one of the most important being how to adjust the number of gaussians to efficiently cover the entire scene. The paper suggests heuristics to clone or split existing gaussians to increase their number and add more coverage.

And the results?

The approach has clear advantages over NeRFs:

Training time is significantly lower (~1 hour vs ~ 1 day) with comparable rendering quality; this is mostly down to training on end-results (rendered image) rather than sampling empty space in a 3D volume
Real-time rendering, since there is no need to sample (mostly empty) points along a ray; the use of pruning and alpha-blending achieves rendering rates greater than 100fps on a modern GPU

Building a scene out of 3D gaussians.

Since this is a novel approach, expect a quick succession of optimization and adaptation in the near future. Some of the open questions are:

Can the training time be reduced further, to the point of almost instant training of new scenes? (the paper’s authors suggest there is room for improvement here)
How do we extract a 3D point cloud or mesh from the set of 3D gaussians, since the end result is a combination of rather fuzzy objects?
Can specialized hardware work directly with 3D gaussians as primitives, much like they work with points or triangles?
The final model is considerably larger than NeRFs (hundreds of Mb versus a few Mb); can a more efficient format be found?
Finally, are there other more amenable 3D primitives that are both differentiable and can be efficiently rendered?

Conclusion

3D gaussian splatting is a novel approach to learning Radiance Fields from a set of images. They address some of the issues that NeRFs have and promise faster training and real-time rendering. These approaches make Radiance Fields an exciting tool for real-world applications and new kind of content.

References

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Texture splatting

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

All images and videos generated by the author, unless attributed. Truck images and videos generated from a model trained on the dataset available on the original paper’s website.