To begin the project, I calibrated my camera and captured images of an object from multiple viewpoints to create a 3D scan. I used a printed ArUco tag for calibration as specified in the project instructions, which had a side length of 0.06m. I ensured that the tag was clearly visible in all images to facilitate accurate pose estimation. Then, I took photos of my chosen object from various angles while keeping the ArUco tag in the frame. For both calibration and object images, I made sure to maintain consistent lighting conditions to minimize shadows and reflections that could affect the calibration accuracy. After running calibration and pose estimation, I visualized the camera frustums in Viser to verify the accuracy of the estimated poses. Below are screenshots of the camera frustums visualization from two different angles.
Side View
Dome View
For this part, I implemented a neural field model that renders a 2D image given pixel coordinates as input. This is achieved using a Multilayer Perceptron (MLP) with Sinusoidal Positional Encoding (PE) to capture high-frequency details. The model first runs the pixel coordinates through PE to expand their dimensionality, and then processes the input through four hidden layers with ReLU activations. The final layer uses a Sigmoid activation to output RGB color values in the range [0, 1].
2D NeRF Model Structure
My model is a coordinate-based MLP that maps 2D pixel locations to RGB colors: (x, y) ∈ [0,1]^2 → RGB color
(r, g, b) ∈ [0,1]^3. For the MLP architecture, I used a 4-layer fully-connected network with ReLU activations
in the hidden layers of width 64 or 256. To enable the network to learn high-frequency variations in the image, I applied
Sinusoidal Positional Encoding (PE) to the input coordinates before feeding them into the MLP. The PE function maps each
coordinate to a higher-dimensional space using sine and cosine functions at multiple frequencies. For each network model,
I used a learning rate of 1e-2 with the Adam optimizer and trained using Mean Squared Error (MSE) loss on RGB values.
My batch size was 10000 randomly sampled pixels per iteration. I tracked reconstruction quality using Peak Signal-to-Noise
Ratio (PSNR) computed from the MSE. I also experimented with different numbers of positional encoding frequencies,
specifically \(L=3\) and \(L=10\), to observe their effect on reconstruction quality.
Below are the training progressions for fitting the neural field to the provided coyote image, along with PSNR curves.
Coyote L3 W64 Iter 1
Coyote L3 W64 Iter 50
Coyote L3 W64 Iter 100
Coyote L3 W64 Iter 500
Coyote L3 W64 Iter 1500
Coyote L3 W64 PSNR Curve
Coyote L10 W256 Iter 1
Coyote L10 W256 Iter 50
Coyote L10 W256 Iter 100
Coyote L10 W256 Iter 500
Coyote L10 W256 Iter 1500
Coyote L10 W256 PSNR Curve
Lighthouse L10 W256 Iter 1
Lighthouse L10 W256 Iter 50
Lighthouse L10 W256 Iter 100
Lighthouse L10 W256 Iter 500
Lighthouse L10 W256 Iter 1500
Lighthouse L10 W256 PSNR Curve
Here are the final reconstruction results for the four combinations of positional encoding frequency and hidden layer width:
Coyote L3 W64
Coyote L3 W256
Coyote L10 W64
Coyote L10 W256
Lighthouse L3 W64
Lighthouse L3 W256
Lighthouse L10 W64
Lighthouse L10 W256
In this part, I implemented a Neural Radiance Field (NeRF) to reconstruct a 3D scene from multiple images taken from different viewpoints. The NeRF model predicts color and density at any point in 3D space by optimizing a continuous volumetric scene representation. To that end, I created helper functions to convert pixel coordinates to camera coordinates and then to rays in world space. The ray origin is the camera position, and the ray direction is computed by transforming pixel coordinates through the camera intrinsics and extrinsics. I sampled rays from multiple images and discretized them into points along each ray using uniform sampling with added perturbations to avoid overfitting.
The function transform_points takes a 3D point \(\mathbf{x}_c\) in the camera's local coordinate system
and transforms it into the global world coordinate system, using the provided camera-to-world (c2w)
matrix. To do this, we convert the 3D point to homogeneous coordinates by adding a '1' as the fourth component.
Then, we perform a standard matrix multiplication. The resulting 3D world coordinate \(\mathbf{x}_w\) is the first
three components of \(\mathbf{\tilde{x}}_w\).
\[ \mathbf{\tilde{x}}_c = \begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} \quad \mathbf{C2W} = \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_x \\ r_{21} & r_{22} & r_{23} & t_y \\ r_{31} & r_{32} & r_{33} & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R} & \mathbf{t} \\ \mathbf{0} & 1 \end{bmatrix} \]
\[ \mathbf{\tilde{x}}_w = \mathbf{C2W} \cdot \mathbf{\tilde{x}}_c \]
The function pixel_to_camera inverts the pinhole camera projection model. We are given a 2D pixel coordinate \((u, v)\) and the camera's intrinsic matrix \(\mathbf{K}\). We want to find the corresponding 3D point \((x_c, y_c, z_c)\) in the camera's coordinate system. The forward projection is:
\[ s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \mathbf{K} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_c \\ y_c \\ s \end{bmatrix} \]
Assuming \(z_c = s\), we can rearrange to solve for \(x_c\) and \(y_c\):
\[ x_c = \frac{u - c_x}{f_x} \cdot s \] \[ y_c = \frac{v - c_y}{f_y} \cdot s \] \[ z_c = s \]
For each pixel \((u, v)\), I want to recover a 3D ray in world space, written as an origin \(\mathbf{r}_o\) and a direction \(\mathbf{r}_d\). Starting from the world-to-camera extrinsics \([\mathbf{R} \mid \mathbf{t}]\), the camera center in world coordinates is
\[ \mathbf{r}_o = -\mathbf{R}^{-1}\mathbf{t}. \]
I then back-project the pixel through the intrinsics to get a point \(\mathbf{X}_w\) on the ray (e.g., at unit depth), and define the ray direction as the normalized vector from the camera center to that point:
\[ \mathbf{r}_d = \frac{\mathbf{X}_w - \mathbf{r}_o} {\left\lVert \mathbf{X}_w - \mathbf{r}_o \right\rVert_2 }. \]
I created a RaysData class to handle loading the multi-view images, camera intrinsics,
and extrinsics. This class computes the rays for all pixels in all training images and stores
them for efficient sampling during training. The rays are represented by their origins and directions,
along with the corresponding pixel colors. I visualized a subset of rays and sampled points along them
using Viser, as shown below.
Lego Rays
Ray sampled from one camera
For Part 2.4, I implemented a NeRF MLP that takes 3D world coordinates and view directions and outputs density and RGB color. I used the network architecture depicted below:
NeRF MLP Structure
As shown above, I first applied sinusoidal positional encoding to the 3D points and ray directions. Then, I fed the encoded positions into a deep MLP with a skip connection that concatenates the input back in the middle layers. The network has two output heads: one with ReLU activation to predict density \(\sigma\), and another with Sigmoid activation to predict RGB color in the range [0, 1].
In Part 2.5, I implemented the NeRF volume rendering equation in PyTorch:
sigmas (B, N, 1) and rgbs (B, N, 3), compute
alpha = 1 - exp(-sigma * step_size).cumprod to get transmittance T, then weights
w = alpha * T for each sample.sum(w[..., None] * rgbs, dim=1).To train the NeRF model, I used the Adam optimizer with a learning rate of \(5 \times 10^{-4}\). I sampled 4096 rays per iteration from the training images and discretized each ray into 64 points using stratified sampling between near and far bounds of 2.0 and 6.0. The loss function was the mean squared error (MSE) between the rendered pixel colors and the ground truth pixel colors. Below are visualizations of the training progression at various iterations, along with the PSNR curve on the validation image.
Lego Iteration 100
Lego Iteration 500
Lego Iteration 1500
Lego Iteration 3000
Lego PSNR
Lego 360 Render
I also trained a NeRF model for my own object of a soda can using the multi-view images I captured. I used a similar training setup as before, but with 10000 rays sampled per iteration and with 10000 iterations. The training process involved minimizing the MSE loss between the rendered pixel colors and the ground truth pixel colors from the images. Below are visualizations of the training progression at various iterations, along with the PSNR curve on the validation set. For this object, I had to adjust the near and far values to 0.05 and 1.0 to better capture the scene depth.
Coke Iteration 100
Coke Iteration 500
Coke Iteration 1500
Coke Iteration 5000
Coke Iteration 10000
Coke 360 Render
Coke PSNR
NeRF Loss