CS 180 Project 5B

The first part of project 5 is about using pretrained diffusion models to generate images from text prompts. I set up the environment as provided in the project instructions, and used a random seed 3036155160 for reproducibility. Here are some sample images generated using the DeepFloyd IF diffusion model with different numbers of inference steps:

A graphic of yellow whale (20 steps)

A graphic of blue starfish (20 steps)

A cheap backpack (20 steps)

Sample images generated from text prompts using DeepFloyd IF diffusion model with 20 inference steps.

A graphic of yellow whale (120 steps)

A graphic of blue starfish (120 steps)

A cheap backpack (120 steps)

Sample images generated from text prompts using DeepFloyd IF diffusion model with 120 inference steps.

Part A1. Sampling Loops

In this part of the problem set, I wrote my own "sampling loops" that use the pretrained DeepFloyd denoisers. These produce high quality images such as the ones generated above. I then modified these sampling loops to solve different tasks such as inpainting or producing optical illusions.

1.1. Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, I wrote a function to implement this. The forward process is defined by:

Campanile

Noisy Campanile at \(t=250\)

Noisy Campanile at \(t=500\)

Noisy Campanile at \(t=750\)

1.2. Classical Denoising

One simple way to denoise an image is to use Gaussian blur filtering. Here are some results:

Gaussian Blur at \(t=250\)

Gaussian Blur at \(t=500\)

Gaussian Blur at \(t=750\)

1.3. One-Step Denoising

Using a pretrained diffusion model, we can denoise images in one step. Here are the results:

One-Step Denoising at \(t=250\)

One-Step Denoising at \(t=500\)

One-Step Denoising at \(t=750\)

1.4. Iterative Denoising

By iteratively denoising an image, we can achieve better results. Here are the results:

Iterative Denoising at \(t=660\)

Iterative Denoising at \(t=510\)

Iterative Denoising at \(t=360\)

Iterative Denoising at \(t=210\)

Iterative Denoising at \(t=60\)

Campanile

Iteratively Denoised Campanile

One-Step Denoised Campanile

Gaussian Blur at \(t=750\)

The iteratively denoised Campanile resembles the original image much more closely than the one-step denoised version.

1.5. Diffusion Model Sampling

By starting from pure noise and iteratively denoising, we can generate new images. Here are some samples:

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

1.6. Classifier-Free Guidance (CFG)

Using Classifier-Free Guidance, we can improve the quality of generated images. Here are some samples:

CFG Sample 1

CFG Sample 2

CFG Sample 3

CFG Sample 4

CFG Sample 5

1.7. Image-to-Image Translation

By adding noise to an image and then denoising it with a text prompt, we can create interesting edits. Here are some examples:

Campanile with \(i_{start}=1\)

Campanile with \(i_{start}=3\)

Campanile with \(i_{start}=5\)

Campanile with \(i_{start}=7\)

Campanile with \(i_{start}=10\)

Campanile with \(i_{start}=20\)

Campanile

Backpack with \(i_{start}=1\)

Backpack with \(i_{start}=3\)

Backpack with \(i_{start}=5\)

Backpack with \(i_{start}=7\)

Backpack with \(i_{start}=10\)

Backpack with \(i_{start}=20\)

Backpack

Character with \(i_{start}=1\)

Character with \(i_{start}=3\)

Character with \(i_{start}=5\)

Character with \(i_{start}=7\)

Character with \(i_{start}=10\)

Character with \(i_{start}=20\)

Character

I also repeated the same procedure on images from the web and hand-drawn images.

Santa with \(i_{start}=1\)

Santa with \(i_{start}=3\)

Santa with \(i_{start}=5\)

Santa with \(i_{start}=7\)

Santa with \(i_{start}=10\)

Santa with \(i_{start}=20\)

Santa

Painting with \(i_{start}=1\)

Painting with \(i_{start}=3\)

Painting with \(i_{start}=5\)

Painting with \(i_{start}=7\)

Painting with \(i_{start}=10\)

Painting with \(i_{start}=20\)

Painting

Camera with \(i_{start}=1\)

Camera with \(i_{start}=3\)

Camera with \(i_{start}=5\)

Camera with \(i_{start}=7\)

Camera with \(i_{start}=10\)

Camera with \(i_{start}=20\)

Camera

1.7.2. Inpainting

By using a mask to specify regions to edit, we can inpaint images. Here are some examples:

Campanile

Campanile Mask

Campanile Inpainted

Coffee

Coffee Mask

Coffee Inpainted

Emoji

Emoji Mask

Emoji Inpainted

1.7.3. Text-Conditional Image-to-Image Translation

By guiding the denoising process with text prompts, we can create edits that align with the desired description. In this part, I used the text embedding "a childish drawing" to guide the edits. Here are some examples:

Childish Campanile with \(i_{start}=1\)

Childish Campanile with \(i_{start}=3\)

Childish Campanile with \(i_{start}=5\)

Childish Campanile with \(i_{start}=7\)

Childish Campanile with \(i_{start}=10\)

Childish Campanile with \(i_{start}=20\)

Campanile

Childish Character with \(i_{start}=1\)

Childish Character with \(i_{start}=3\)

Childish Character with \(i_{start}=5\)

Childish Character with \(i_{start}=7\)

Childish Character with \(i_{start}=10\)

Childish Character with \(i_{start}=20\)

Character

Childish Emoji with \(i_{start}=1\)

Childish Emoji with \(i_{start}=3\)

Childish Emoji with \(i_{start}=5\)

Childish Emoji with \(i_{start}=7\)

Childish Emoji with \(i_{start}=10\)

Childish Emoji with \(i_{start}=20\)

Emoji

1.8. Visual Anagrams

In this part, I created visual anagrams where the image appears as one thing when viewed normally, and another when flipped upside down. Here are some examples:

An oil painting of an old man

An oil painting of people around a campfire

A bowl of noodles

A stadium

An oil painting of a snowy mountain village

A photo of a hipster barista

1.9. Hybrid Images

In this part, I created hybrid images that change appearance based on viewing distance. This uses a similar technique to visual anagrams, but blends high and low frequency components. Here are some examples:

Skull and waterfall

Stadium and noodles

Campfire and Amalfi coast

Part B1. Training a Single-Step Denoising UNet

Implementing the UNet

I implemented the UNet architecture as described in the project instructions using PyTorch. The UNet consists of downsampling and upsampling blocks with skip connections. I defined the necessary operations such as Conv, DownConv, UpConv, Flatten, Unflatten, and Concat, and composed them to create a deeper network.

1.2. Using the UNet to Train a Denoiser

I trained the UNet to denoise images from the MNIST dataset. For each training batch, I generated noisy images by adding Gaussian noise with varying levels of noise. The model was optimized using an L2 loss function to minimize the difference between the denoised output and the clean images. I first visualized the noising process on sample MNIST digits:

Number 5 Noise

Number 0 Noise

Then, I trained the UNet for 5 epochs and visualized the denoised results on the test set after the 1st and 5th epochs. I used the model as specified in the project instructions with a hidden dimension of 128 and Adam optimizer with a learning rate of 1e-4. Here are the training loss curve and sample results:

Denoise training sample at \(\sigma=0.5\)

Denoise Loss

Here are the full denoising results for each digits:

1.2.3. Denoising Pure Noise

I also tested the denoiser on out-of-distribution noise levels and observed how the model performed on different noise levels. Additionally, I trained the model to denoise pure Gaussian noise and visualized the generated outputs after 1 and 5 epochs. Here are the results:

Pure Noise Generation after Epoch 1

Pure Noise Generation after Epoch 5

Pure Noise Generation Training Loss

The generated outputs from pure noise exhibited patterns resembling the average of the training images, which is expected due to the nature of the MSE loss function. The model learns to predict the point that minimizes the sum of squared distances to all training examples, leading to outputs that look like the average of the training data distribution. For the higher epochs, the outputs became more refined with smoother edges.

Part B2. Training a Flow Matching Model

2.1. Adding Time Conditioning to UNet

For this part, I modified the UNet architecture to include time conditioning using FCBlocks. The scalar time variable \(t\) was normalized and embedded using two FCBlocks, which were integrated into the Unflatten and UpConv layers of the UNet as described in the project instructions.

2.2. Training the UNet

I trained the time-conditioned UNet on the MNIST dataset to predict the flow from noisy images to clean images at various timesteps. The model was optimized using the Adam optimizer with an initial learning rate of 1e-2 and an exponential learning rate decay scheduler. Here is the training loss curve:

Time-conditioned UNet Training Loss

2.3. Sampling from the Time-conditioned UNet

After training, I sampled from the time-conditioned UNet for 1, 5, and 10 epochs. The results showed that legible digits emerged as the number of epochs increased, demonstrating the effectiveness of the flow matching approach. Here are the sampling results:

Time-conditioned UNet Sampling after Epoch 1

Time-conditioned UNet Sampling after Epoch 5

Time-conditioned UNet Sampling after Epoch 10

2.4. Adding Class-Conditioning to UNet

I further modified the UNet to include class conditioning using one-hot vectors for the digit classes (0-9). I implemented dropout to randomly drop the class conditioning vector during training, allowing the model to learn both conditional and unconditional generation.

2.5. Training the Class-Conditioned UNet

I trained the class-conditioned UNet using the same procedure as the time-conditioned UNet, with the addition of class conditioning. The training loss curve is shown below:

Class-Conditioned UNet Training Loss

2.6. Sampling from the Class-Conditioned UNet

Finally, I sampled from the class-conditioned UNet for 1, 5, and 10 epochs using classifier-free guidance. The results showed that class-conditioning allowed for faster convergence and more accurate digit generation. Here are the sampling results:

Class-Conditioned UNet Sampling after Epoch 1

Class-Conditioned UNet Sampling after Epoch 5

Class-Conditioned UNet Sampling after Epoch 10

Finally, I experimented with training the class-conditioned UNet without the learning rate scheduler. I removed the exponential learning rate scheduler and trained the model using a constant learning rate throughout all epochs. To keep training stable without decay, I lowered the learning rate from \(1\times 10^{-2}\) to \(1\times 10^{-3}\), since a larger fixed step size tended to make the optimization noisy and prevented the loss from settling. With the constant learning rate, the model still converged, but the improvements were more gradual compared to the scheduled run. Below are the sampling results after 10 epochs without the scheduler:

Class-Conditioned UNet Sampling without Scheduler after Epoch 10

Fun With Diffusion Models!

Part A0. Setup

Part A1. Sampling Loops

1.1. Implementing the Forward Process

1.2. Classical Denoising

1.3. One-Step Denoising

1.4. Iterative Denoising

1.5. Diffusion Model Sampling

1.6. Classifier-Free Guidance (CFG)

1.7. Image-to-Image Translation

1.7.2. Inpainting

1.7.3. Text-Conditional Image-to-Image Translation

1.8. Visual Anagrams

1.9. Hybrid Images

Part B1. Training a Single-Step Denoising UNet

Implementing the UNet

1.2. Using the UNet to Train a Denoiser

1.2.3. Denoising Pure Noise

Part B2. Training a Flow Matching Model

2.1. Adding Time Conditioning to UNet

2.2. Training the UNet

2.3. Sampling from the Time-conditioned UNet

2.4. Adding Class-Conditioning to UNet

2.5. Training the Class-Conditioned UNet

2.6. Sampling from the Class-Conditioned UNet