Gaussian splatting: A new technique for rendering 3D scenes
A successor to neural radiance fields (NeRF)
Gaussian splatting: A new technique for rendering 3D scenes -- a successor to neural radiance fields (NeRF). In traditional computer graphics, scenes are represented as meshes of polygons. The polygons have a surface that reflects light, and the GPU will calculate what angle the light hits the polygons and how the polygon surface affects the reflected light -- color, diffusion, transparency, etc.
In the world of neural radiance fields (NeRF), a neural network, trained on a set of photos from a scene, will be asked: from this point in space, with the light ray going in this direction, how should the pixel rendered to the screen be colored? In other words, you are challenging a neural network to learn ray tracing. The neural network is challenged to learn all the details of the scene and store those in its parameters. It's a crazy idea, but it works. But it also has limitations -- you can't view the scene from an angle very far from the original photos, it doesn't work for scenes that are too large, and so on. At no point does it ever translate the scene into traditional computer graphics meshes, so the scene can't be used in a video game or anything like that.
The new technique goes by the unglamorous name "Gaussian splatting". This time, instead of asking for a neural network to tell you the result of a ray trace, you're asking it, initially, to render the scene as a "point cloud" -- that is to say, simply a collection of points, not even polygons. This is just the first step. Once you have the initial point cloud, then you switch to Gaussians.
The concept of a 3D "Gaussian" may take a minute to wrap your brain around. We're all familiar with the "bell curve", also called the normal distribution, also called the Gaussian distribution. This function is a function of 1 dimension, that is to say, G = f(x). To make the 3D Gaussian, you do that with all 3 dimensions. So G = f(x, y, z).
Not only that, but they make a big issue in the paper about the fact that their 3D Gaussians are "anisotropic". What this means is that they are not nice, spherical gaussians, but rather, they are stretched -- and have a direction. When rendering 3D graphics, many materials are directional, such as wood grain, brushed metal, fabric, and hair. They even have further uses, such as for scenes where the light source is not spherical, the viewing angle is very oblique, the texture is viewed from a sharp angle, and scenes that have sharp edges.
At this point you might be thinking: this all sounds a lot more complicated than simple polygons. What does using 3D Gaussians get us? The answer is that, unlike polygons, 3D Gaussians are *differentiable*. That magic word means you can calculate a gradient, which you might remember from your calculus class. Having a gradient means you can train it using stochastic gradient descent -- in other words, you can train it with standard deep learning training techniques. Now you've brought your 3D representation into the world of neural networks.
Even so, more cleverness is required to make the system work. The researchers made a special training system that enables geometry to be created, deleted, or moved within a scene, because, inevitably, geometry gets incorrectly placed due to the ambiguities of the initial 3D to 2D projection. After every 100 iterations, Gaussians that are "essentially transparent" are removed, while new ones are added to "densify" and fill gaps in the scene. To do this, they made an algorithm to detect regions that are not yet well reconstructed.
With these in place, the training system generates 2D images and compares them to the training views provided, and iterates until it can render them well.
To do the rendering, they developed their own fast renderer for Gaussians, which is where the word "splats" in the name comes from. When the Gaussians are rendered, they are called "splats". The reason they took the trouble to create their own rendering system was -- you guessed it -- to make the entire rendering pipeline differentiable.
The project website:
https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
DreamGaussian: Gaussian splatting in a generative, rather than reconstructive, context. By "reconstructive", we mean like what I was talking about in my post from yesterday -- taking a bunch of photos and reconstructing a scene. Here, we want to *generate* a scene from nothing -- or not much, just an image or even just a text prompt.
Just as when generating an image with diffusion models, you start with random noise. With diffusion models like DALL-E, Midjourney, or Stable Diffusion, you generate random noise and then step by step reverse the process of going from a high-quality image to noise, with the process being guided by your input (text prompt or image or both). Here, random 3D Gaussians are generated at random locations in a sphere. At each step, a random camera pose orbiting the object center is chosen, and the RGB image is rendered, and transparency of the view. An algorithm called score distillation sampling (SDS) is used to perform the optimization at this step. SDS is an algorithm that uses deep learning to optimize the 3D representation of something from a 2D image. Basically it works by using a fully differentiable rendering process from the 3D representation to the 2D image, which allows a loss function to be calculated on the image, which can then be backpropagated back into the 3D representation.
After the optimization, on each step there is also a "densification", where the density of the 3D Gaussians is increased.
At this point the researchers tried their hand at transferring the 3D Gaussians to a traditional polygon mesh.
The result is a blurry 3D model. At this point another technique, the final 2D textures of the polygons are refined using a denoising algorithm. They found it helpful to actually add noise before denoising.
The end result is fine polygonal 3D models that you can use in a video game.
https://dreamgaussian.github.io/
"Gsgen: Text-to-3D using Gaussian Splatting". More Gaussian splatting for you all.
The system starts off by using a regular 2D image generation system, in this case Stable Diffusion, to generate an initial image from the text description. To go from the 2D image to the initial 3D, the score distillation sampling (SDS) algorithm is used. The key to this is converting the image to a set of parameters that represent the image, but that are related to the image in such a way as to make the computation from the parameters to the image differentiable. Here again making things differentiable, which is to say, created using functions that have derivatives as you learned in calculus, is what makes them amenable to deep learning techniques. The computation from parameters to image also takes camera pose parameters. The key to the SDS system is its ability to guess whether a rendered image from a new camera angle is a plausible rendition of the object.
To get from this to Gaussian splatting, the key is to get from the parameters used by SDS to a "point cloud". Once this is done, the points are converted to spherical Gaussians. These are called "isotropic" Gaussians. The optimization process that kicks in here will change them to anisotropic Gaussians, which is to say, Gaussians that are not spherical but elongated in some direction.
So far, more or less the same as yesterday's DreamGaussian post. Here's where the paths diverge. Here they introduce a new algorithm called Point-E, a pretrained text-to-point-cloud diffusion model. Not a text-to-2D model, but straight to 3D in the form of a point cloud.
Instead of directly aligning the Gaussians with a Point-E generated point cloud, they use the SDS algorithm to change the positions of the Gaussians, guided by the 2D image.
After this there is a new "appearance refinement stage". It densifies the Gaussians and optimizes their placement and color, guided by the 2D image. They found SDS has a problem where, if the SDS "threshold" is too small, it can generate an excessive number of Gaussians, but if the SDS "threshold" is too large, it leads to a blurry appearance. They came up with a new system that, for each Gaussian, obtains its K nearest neighbors. For each of the neighbors, "if the distance between the Gaussian and its neighbor is smaller than the sum of their radius, a Gaussian will be added between them. With a radius equal to the residual."
In this way the system is a 2-step process: first you refine the geometry, then you refine the appearance.
Polycam purports to have turned Gaussian splatting into a product anyone can use. The examples look impressive.