Fun With Diffusion Models!

Part A0. Setup

The first part of project 5 is about using pretrained diffusion models to generate images from text prompts. I set up the environment as provided in the project instructions, and used a random seed 3036155160 for reproducibility. Here are some sample images generated using the DeepFloyd IF diffusion model with different numbers of inference steps:

Whale (20)

A graphic of yellow whale (20 steps)

Starfish (20)

A graphic of blue starfish (20 steps)

Backpack (20)

A cheap backpack (20 steps)

Sample images generated from text prompts using DeepFloyd IF diffusion model with 20 inference steps.
Whale (120)

A graphic of yellow whale (120 steps)

Starfish (120)

A graphic of blue starfish (120 steps)

Backpack (120)

A cheap backpack (120 steps)

Sample images generated from text prompts using DeepFloyd IF diffusion model with 120 inference steps.

Part A1. Sampling Loops

In this part of the problem set, I wrote my own "sampling loops" that use the pretrained DeepFloyd denoisers. These produce high quality images such as the ones generated above. I then modified these sampling loops to solve different tasks such as inpainting or producing optical illusions.

1.1. Implementing the Forward Process

A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, I wrote a function to implement this. The forward process is defined by:

Campanile

Campanile

Campanile (250)

Noisy Campanile at \(t=250\)

Campanile (500)

Noisy Campanile at \(t=500\)

Campanile (750)

Noisy Campanile at \(t=750\)

1.2. Classical Denoising

One simple way to denoise an image is to use Gaussian blur filtering. Here are some results:

Campanile Gaussian (250)

Gaussian Blur at \(t=250\)

Campanile Gaussian (500)

Gaussian Blur at \(t=500\)

Campanile Gaussian (750)

Gaussian Blur at \(t=750\)

1.3. One-Step Denoising

Using a pretrained diffusion model, we can denoise images in one step. Here are the results:

Campanile One-Step Denoise (250)

One-Step Denoising at \(t=250\)

Campanile One-Step Denoise (500)

One-Step Denoising at \(t=500\)

Campanile One-Step Denoise (750)

One-Step Denoising at \(t=750\)

1.4. Iterative Denoising

By iteratively denoising an image, we can achieve better results. Here are the results:

Campanile Iterative Denoise (t=660)

Iterative Denoising at \(t=660\)

Campanile Iterative Denoise (t=510)

Iterative Denoising at \(t=510\)

Campanile Iterative Denoise (t=360)

Iterative Denoising at \(t=360\)

Campanile Iterative Denoise (t=210)

Iterative Denoising at \(t=210\)

Campanile Iterative Denoise (t=60)

Iterative Denoising at \(t=60\)

Campanile

Campanile

Campanile Iterative Denoise (Final)

Iteratively Denoised Campanile

Campanile One-Step Denoise (Final)

One-Step Denoised Campanile

Campanile Gaussian (750)

Gaussian Blur at \(t=750\)

The iteratively denoised Campanile resembles the original image much more closely than the one-step denoised version.

1.5. Diffusion Model Sampling

By starting from pure noise and iteratively denoising, we can generate new images. Here are some samples:

Sample 1

Sample 1

Sample 2

Sample 2

Sample 3

Sample 3

Sample 4

Sample 4

Sample 5

Sample 5

1.6. Classifier-Free Guidance (CFG)

Using Classifier-Free Guidance, we can improve the quality of generated images. Here are some samples:

Sample 1

CFG Sample 1

Sample 2

CFG Sample 2

Sample 3

CFG Sample 3

Sample 4

CFG Sample 4

Sample 5

CFG Sample 5

1.7. Image-to-Image Translation

By adding noise to an image and then denoising it with a text prompt, we can create interesting edits. Here are some examples:

Campanile i_start=1

Campanile with \(i_{start}=1\)

Campanile i_start=3

Campanile with \(i_{start}=3\)

Campanile i_start=5

Campanile with \(i_{start}=5\)

Campanile i_start=7

Campanile with \(i_{start}=7\)

Campanile i_start=10

Campanile with \(i_{start}=10\)

Campanile i_start=20

Campanile with \(i_{start}=20\)

Campanile

Campanile

Backpack i_start=1

Backpack with \(i_{start}=1\)

Backpack i_start=3

Backpack with \(i_{start}=3\)

Backpack i_start=5

Backpack with \(i_{start}=5\)

Backpack i_start=7

Backpack with \(i_{start}=7\)

Backpack i_start=10

Backpack with \(i_{start}=10\)

Backpack i_start=20

Backpack with \(i_{start}=20\)

Backpack

Backpack

Character i_start=1

Character with \(i_{start}=1\)

Character i_start=3

Character with \(i_{start}=3\)

Character i_start=5

Character with \(i_{start}=5\)

Character i_start=7

Character with \(i_{start}=7\)

Character i_start=10

Character with \(i_{start}=10\)

Character i_start=20

Character with \(i_{start}=20\)

Character

Character

I also repeated the same procedure on images from the web and hand-drawn images.

Santa i_start=1

Santa with \(i_{start}=1\)

Santa i_start=3

Santa with \(i_{start}=3\)

Santa i_start=5

Santa with \(i_{start}=5\)

Santa i_start=7

Santa with \(i_{start}=7\)

Santa i_start=10

Santa with \(i_{start}=10\)

Santa i_start=20

Santa with \(i_{start}=20\)

Santa

Santa

Painting i_start=1

Painting with \(i_{start}=1\)

Painting i_start=3

Painting with \(i_{start}=3\)

Painting i_start=5

Painting with \(i_{start}=5\)

Painting i_start=7

Painting with \(i_{start}=7\)

Painting i_start=10

Painting with \(i_{start}=10\)

Painting i_start=20

Painting with \(i_{start}=20\)

Painting

Painting

Camera i_start=1

Camera with \(i_{start}=1\)

Camera i_start=3

Camera with \(i_{start}=3\)

Camera i_start=5

Camera with \(i_{start}=5\)

Camera i_start=7

Camera with \(i_{start}=7\)

Camera i_start=10

Camera with \(i_{start}=10\)

Camera i_start=20

Camera with \(i_{start}=20\)

Camera

Camera

1.7.2. Inpainting

By using a mask to specify regions to edit, we can inpaint images. Here are some examples:

Campanile

Campanile

Campanile Mask

Campanile Mask

Campanile Inpainted

Campanile Inpainted

Coffee

Coffee

Coffee Mask

Coffee Mask

Coffee Inpainted

Coffee Inpainted

Emoji

Emoji

Emoji Mask

Emoji Mask

Emoji Inpainted

Emoji Inpainted

1.7.3. Text-Conditional Image-to-Image Translation

By guiding the denoising process with text prompts, we can create edits that align with the desired description. In this part, I used the text embedding "a childish drawing" to guide the edits. Here are some examples:

Childish Campanile

Childish Campanile with \(i_{start}=1\)

Childish Campanile

Childish Campanile with \(i_{start}=3\)

Childish Campanile

Childish Campanile with \(i_{start}=5\)

Childish Campanile

Childish Campanile with \(i_{start}=7\)

Childish Campanile i_start=10

Childish Campanile with \(i_{start}=10\)

Childish Campanile i_start=20

Childish Campanile with \(i_{start}=20\)

Campanile

Campanile

Childish Character

Childish Character with \(i_{start}=1\)

Childish Character

Childish Character with \(i_{start}=3\)

Childish Character

Childish Character with \(i_{start}=5\)

Childish Character

Childish Character with \(i_{start}=7\)

Childish Character i_start=10

Childish Character with \(i_{start}=10\)

Childish Character i_start=20

Childish Character with \(i_{start}=20\)

Character

Character

Childish Emoji

Childish Emoji with \(i_{start}=1\)

Childish Emoji

Childish Emoji with \(i_{start}=3\)

Childish Emoji

Childish Emoji with \(i_{start}=5\)

Childish Emoji

Childish Emoji with \(i_{start}=7\)

Childish Emoji i_start=10

Childish Emoji with \(i_{start}=10\)

Childish Emoji i_start=20

Childish Emoji with \(i_{start}=20\)

Emoji

Emoji

1.8. Visual Anagrams

In this part, I created visual anagrams where the image appears as one thing when viewed normally, and another when flipped upside down. Here are some examples:

An oil painting of an old man

An oil painting of an old man

An oil painting of people around a campfire

An oil painting of people around a campfire

A bowl of noodles

A bowl of noodles

A stadium

A stadium

A bowl of noodles

An oil painting of a snowy mountain village

A stadium

A photo of a hipster barista

1.9. Hybrid Images

In this part, I created hybrid images that change appearance based on viewing distance. This uses a similar technique to visual anagrams, but blends high and low frequency components. Here are some examples:

Skull and Waterfall

Skull and waterfall

Stadium and Noodles

Stadium and noodles

Stadium and Noodles

Campfire and Amalfi coast

Part B1. Training a Single-Step Denoising UNet

Implementing the UNet

I implemented the UNet architecture as described in the project instructions using PyTorch. The UNet consists of downsampling and upsampling blocks with skip connections. I defined the necessary operations such as Conv, DownConv, UpConv, Flatten, Unflatten, and Concat, and composed them to create a deeper network.

1.2. Using the UNet to Train a Denoiser

I trained the UNet to denoise images from the MNIST dataset. For each training batch, I generated noisy images by adding Gaussian noise with varying levels of noise. The model was optimized using an L2 loss function to minimize the difference between the denoised output and the clean images. I first visualized the noising process on sample MNIST digits:

Noise 5

Number 5 Noise

Noise 0

Number 0 Noise

Then, I trained the UNet for 5 epochs and visualized the denoised results on the test set after the 1st and 5th epochs. I used the model as specified in the project instructions with a hidden dimension of 128 and Adam optimizer with a learning rate of 1e-4. Here are the training loss curve and sample results:

Denoise

Denoise training sample at \(\sigma=0.5\)

Denoise Loss

Denoise Loss

Here are the full denoising results for each digits:

Denoise 0
Denoise 1
Denoise 2
Denoise 3
Denoise 4
Denoise 5
Denoise 6
Denoise 7
Denoise 8
Denoise 9

1.2.3. Denoising Pure Noise

I also tested the denoiser on out-of-distribution noise levels and observed how the model performed on different noise levels. Additionally, I trained the model to denoise pure Gaussian noise and visualized the generated outputs after 1 and 5 epochs. Here are the results:

Pure Noise Epoch 1

Pure Noise Generation after Epoch 1

Pure Noise Epoch 5

Pure Noise Generation after Epoch 5

Pure Noise Loss

Pure Noise Generation Training Loss

The generated outputs from pure noise exhibited patterns resembling the average of the training images, which is expected due to the nature of the MSE loss function. The model learns to predict the point that minimizes the sum of squared distances to all training examples, leading to outputs that look like the average of the training data distribution. For the higher epochs, the outputs became more refined with smoother edges.

Part B2. Training a Flow Matching Model

2.1. Adding Time Conditioning to UNet

For this part, I modified the UNet architecture to include time conditioning using FCBlocks. The scalar time variable \(t\) was normalized and embedded using two FCBlocks, which were integrated into the Unflatten and UpConv layers of the UNet as described in the project instructions.

2.2. Training the UNet

I trained the time-conditioned UNet on the MNIST dataset to predict the flow from noisy images to clean images at various timesteps. The model was optimized using the Adam optimizer with an initial learning rate of 1e-2 and an exponential learning rate decay scheduler. Here is the training loss curve:

Time-conditioned UNet Loss

Time-conditioned UNet Training Loss

2.3. Sampling from the Time-conditioned UNet

After training, I sampled from the time-conditioned UNet for 1, 5, and 10 epochs. The results showed that legible digits emerged as the number of epochs increased, demonstrating the effectiveness of the flow matching approach. Here are the sampling results:

Time UNet Epoch 1

Time-conditioned UNet Sampling after Epoch 1

Time UNet Epoch 5

Time-conditioned UNet Sampling after Epoch 5

Time UNet Epoch 10

Time-conditioned UNet Sampling after Epoch 10

2.4. Adding Class-Conditioning to UNet

I further modified the UNet to include class conditioning using one-hot vectors for the digit classes (0-9). I implemented dropout to randomly drop the class conditioning vector during training, allowing the model to learn both conditional and unconditional generation.

2.5. Training the Class-Conditioned UNet

I trained the class-conditioned UNet using the same procedure as the time-conditioned UNet, with the addition of class conditioning. The training loss curve is shown below:

Class Conditioned Loss

Class-Conditioned UNet Training Loss

2.6. Sampling from the Class-Conditioned UNet

Finally, I sampled from the class-conditioned UNet for 1, 5, and 10 epochs using classifier-free guidance. The results showed that class-conditioning allowed for faster convergence and more accurate digit generation. Here are the sampling results:

Class UNet Epoch 1

Class-Conditioned UNet Sampling after Epoch 1

Class UNet Epoch 5

Class-Conditioned UNet Sampling after Epoch 5

Class UNet Epoch 10

Class-Conditioned UNet Sampling after Epoch 10

Finally, I experimented with training the class-conditioned UNet without the learning rate scheduler. I removed the exponential learning rate scheduler and trained the model using a constant learning rate throughout all epochs. To keep training stable without decay, I lowered the learning rate from \(1\times 10^{-2}\) to \(1\times 10^{-3}\), since a larger fixed step size tended to make the optimization noisy and prevented the loss from settling. With the constant learning rate, the model still converged, but the improvements were more gradual compared to the scheduled run. Below are the sampling results after 10 epochs without the scheduler:

No Scheduler Epoch 10

Class-Conditioned UNet Sampling without Scheduler after Epoch 10