Neural Radiance Field

Part 0. Calibrating Your Camera and Capturing a 3D Scan

To begin the project, I calibrated my camera and captured images of an object from multiple viewpoints to create a 3D scan. I used a printed ArUco tag for calibration as specified in the project instructions, which had a side length of 0.06m. I ensured that the tag was clearly visible in all images to facilitate accurate pose estimation. Then, I took photos of my chosen object from various angles while keeping the ArUco tag in the frame. For both calibration and object images, I made sure to maintain consistent lighting conditions to minimize shadows and reflections that could affect the calibration accuracy. After running calibration and pose estimation, I visualized the camera frustums in Viser to verify the accuracy of the estimated poses. Below are screenshots of the camera frustums visualization from two different angles.

Side View

Dome View

Camera frustums visualization from the top and the side.
View from one side of the object is not fully captured since the object was blocking the tag.

Part 1. Fit a Neural Field to a 2D Image

For this part, I implemented a neural field model that renders a 2D image given pixel coordinates as input. This is achieved using a Multilayer Perceptron (MLP) with Sinusoidal Positional Encoding (PE) to capture high-frequency details. The model first runs the pixel coordinates through PE to expand their dimensionality, and then processes the input through four hidden layers with ReLU activations. The final layer uses a Sigmoid activation to output RGB color values in the range [0, 1].

2D NeRF Model Structure

My model is a coordinate-based MLP that maps 2D pixel locations to RGB colors: (x, y) ∈ [0,1]^2 → RGB color (r, g, b) ∈ [0,1]^3. For the MLP architecture, I used a 4-layer fully-connected network with ReLU activations in the hidden layers of width 64 or 256. To enable the network to learn high-frequency variations in the image, I applied Sinusoidal Positional Encoding (PE) to the input coordinates before feeding them into the MLP. The PE function maps each coordinate to a higher-dimensional space using sine and cosine functions at multiple frequencies. For each network model, I used a learning rate of 1e-2 with the Adam optimizer and trained using Mean Squared Error (MSE) loss on RGB values. My batch size was 10000 randomly sampled pixels per iteration. I tracked reconstruction quality using Peak Signal-to-Noise Ratio (PSNR) computed from the MSE. I also experimented with different numbers of positional encoding frequencies, specifically \(L=3\) and \(L=10\), to observe their effect on reconstruction quality.

Below are the training progressions for fitting the neural field to the provided coyote image, along with PSNR curves.

Coyote L3 W64 Iter 1

Coyote L3 W64 Iter 50

Coyote L3 W64 Iter 100

Coyote L3 W64 Iter 500

Coyote L3 W64 Iter 1500

Coyote L3 W64 PSNR Curve

Coyote reconstruction with \(L=3\) and \(width=64\)

Coyote L10 W256 Iter 1

Coyote L10 W256 Iter 50

Coyote L10 W256 Iter 100

Coyote L10 W256 Iter 500

Coyote L10 W256 Iter 1500

Coyote L10 W256 PSNR Curve

Coyote reconstruction with \(L=10\) and \(width=256\)

Lighthouse L10 W256 Iter 1

Lighthouse L10 W256 Iter 50

Lighthouse L10 W256 Iter 100

Lighthouse L10 W256 Iter 500

Lighthouse L10 W256 Iter 1500

Lighthouse L10 W256 PSNR Curve

Lighthouse reconstruction with \(L=10\) and \(width=256\)

Here are the final reconstruction results for the four combinations of positional encoding frequency and hidden layer width:

Coyote L3 W64

Coyote L3 W256

Coyote L10 W64

Coyote L10 W256

Final reconstruction results for Coyote image.
L10/W64 produces more detailed image than L3/W256.

Lighthouse L3 W64

Lighthouse L3 W256

Lighthouse L10 W64

Lighthouse L10 W256

Final reconstruction results for Lighthouse image.

Part 2. Fit a Neural Radiance Field from Multi-view Images

In this part, I implemented a Neural Radiance Field (NeRF) to reconstruct a 3D scene from multiple images taken from different viewpoints. The NeRF model predicts color and density at any point in 3D space by optimizing a continuous volumetric scene representation. To that end, I created helper functions to convert pixel coordinates to camera coordinates and then to rays in world space. The ray origin is the camera position, and the ray direction is computed by transforming pixel coordinates through the camera intrinsics and extrinsics. I sampled rays from multiple images and discretized them into points along each ray using uniform sampling with added perturbations to avoid overfitting.

Camera to World Coordinate Conversion

The function transform_points takes a 3D point \(\mathbf{x}_c\) in the camera's local coordinate system and transforms it into the global world coordinate system, using the provided camera-to-world (c2w) matrix. To do this, we convert the 3D point to homogeneous coordinates by adding a '1' as the fourth component. Then, we perform a standard matrix multiplication. The resulting 3D world coordinate \(\mathbf{x}_w\) is the first three components of \(\mathbf{\tilde{x}}_w\).

\[ \mathbf{\tilde{x}}_c = \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} \quad \mathbf{C2W} = \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R} & \mathbf{t} \\ \mathbf{0} & 1 \end{bmatrix} \]

\[ \mathbf{\tilde{x}}_w = \mathbf{C2W} \cdot \mathbf{\tilde{x}}_c \]

Pixel to Camera Coordinate Conversion

The function pixel_to_camera inverts the pinhole camera projection model. We are given a 2D pixel coordinate \((u, v)\) and the camera's intrinsic matrix \(\mathbf{K}\). We want to find the corresponding 3D point \((x_c, y_c, z_c)\) in the camera's coordinate system. The forward projection is:

\[ s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_c \\ y_c \\ s \end{bmatrix} \]

Assuming \(z_c = s\), we can rearrange to solve for \(x_c\) and \(y_c\):

\[ x_c = \frac{u - c_x}{f_x} \cdot s \] \[ y_c = \frac{v - c_y}{f_y} \cdot s \] \[ z_c = s \]

Pixel to Ray

For each pixel \((u, v)\), I want to recover a 3D ray in world space, written as an origin \(\mathbf{r}_o\) and a direction \(\mathbf{r}_d\). Starting from the world-to-camera extrinsics \([\mathbf{R} \mid \mathbf{t}]\), the camera center in world coordinates is

\[ \mathbf{r}_o = -\mathbf{R}^{-1}\mathbf{t}. \]

I then back-project the pixel through the intrinsics to get a point \(\mathbf{X}_w\) on the ray (e.g., at unit depth), and define the ray direction as the normalized vector from the camera center to that point:

\[ \mathbf{r}_d = \frac{\mathbf{X}_w - \mathbf{r}_o} {\left\lVert \mathbf{X}_w - \mathbf{r}_o \right\rVert_2 }. \]

Dataloader

I created a RaysData class to handle loading the multi-view images, camera intrinsics, and extrinsics. This class computes the rays for all pixels in all training images and stores them for efficient sampling during training. The rays are represented by their origins and directions, along with the corresponding pixel colors. I visualized a subset of rays and sampled points along them using Viser, as shown below.

Lego Rays

Ray sampled from one camera

Neural Radiance Field

For Part 2.4, I implemented a NeRF MLP that takes 3D world coordinates and view directions and outputs density and RGB color. I used the network architecture depicted below:

NeRF MLP Structure

As shown above, I first applied sinusoidal positional encoding to the 3D points and ray directions. Then, I fed the encoded positions into a deep MLP with a skip connection that concatenates the input back in the middle layers. The network has two output heads: one with ReLU activation to predict density \(\sigma\), and another with Sigmoid activation to predict RGB color in the range [0, 1].

Volume Rendering

In Part 2.5, I implemented the NeRF volume rendering equation in PyTorch:

Given sigmas (B, N, 1) and rgbs (B, N, 3), compute alpha = 1 - exp(-sigma * step_size).
Use cumprod to get transmittance T, then weights w = alpha * T for each sample.
Compute the final pixel color per ray as sum(w[..., None] * rgbs, dim=1).

To train the NeRF model, I used the Adam optimizer with a learning rate of \(5 \times 10^{-4}\). I sampled 4096 rays per iteration from the training images and discretized each ray into 64 points using stratified sampling between near and far bounds of 2.0 and 6.0. The loss function was the mean squared error (MSE) between the rendered pixel colors and the ground truth pixel colors. Below are visualizations of the training progression at various iterations, along with the PSNR curve on the validation image.

Lego Iteration 100

Lego Iteration 500

Lego Iteration 1500

Lego Iteration 3000

Lego PSNR

Lego 360 Render

I also trained a NeRF model for my own object of a soda can using the multi-view images I captured. I used a similar training setup as before, but with 10000 rays sampled per iteration and with 10000 iterations. The training process involved minimizing the MSE loss between the rendered pixel colors and the ground truth pixel colors from the images. Below are visualizations of the training progression at various iterations, along with the PSNR curve on the validation set. For this object, I had to adjust the near and far values to 0.05 and 1.0 to better capture the scene depth.

Coke Iteration 100

Coke Iteration 500

Coke Iteration 1500

Coke Iteration 5000

Coke Iteration 10000

Coke 360 Render

Coke PSNR

NeRF Loss