The first part of project 5 is about using pretrained diffusion models to generate images from text prompts.
I set up the environment as provided in the project instructions, and used a random seed 3036155160
for reproducibility. Here are some sample images generated using the DeepFloyd IF diffusion model with different
numbers of inference steps:
A graphic of yellow whale (20 steps)
A graphic of blue starfish (20 steps)
A cheap backpack (20 steps)
A graphic of yellow whale (120 steps)
A graphic of blue starfish (120 steps)
A cheap backpack (120 steps)
In this part of the problem set, I wrote my own "sampling loops" that use the pretrained DeepFloyd denoisers. These produce high quality images such as the ones generated above. I then modified these sampling loops to solve different tasks such as inpainting or producing optical illusions.
A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, I wrote a function to implement this. The forward process is defined by:
Campanile
Noisy Campanile at \(t=250\)
Noisy Campanile at \(t=500\)
Noisy Campanile at \(t=750\)
One simple way to denoise an image is to use Gaussian blur filtering. Here are some results:
Gaussian Blur at \(t=250\)
Gaussian Blur at \(t=500\)
Gaussian Blur at \(t=750\)
Using a pretrained diffusion model, we can denoise images in one step. Here are the results:
One-Step Denoising at \(t=250\)
One-Step Denoising at \(t=500\)
One-Step Denoising at \(t=750\)
By iteratively denoising an image, we can achieve better results. Here are the results:
Iterative Denoising at \(t=660\)
Iterative Denoising at \(t=510\)
Iterative Denoising at \(t=360\)
Iterative Denoising at \(t=210\)
Iterative Denoising at \(t=60\)
Campanile
Iteratively Denoised Campanile
One-Step Denoised Campanile
Gaussian Blur at \(t=750\)
By starting from pure noise and iteratively denoising, we can generate new images. Here are some samples:
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Using Classifier-Free Guidance, we can improve the quality of generated images. Here are some samples:
CFG Sample 1
CFG Sample 2
CFG Sample 3
CFG Sample 4
CFG Sample 5
By adding noise to an image and then denoising it with a text prompt, we can create interesting edits. Here are some examples:
Campanile with \(i_{start}=1\)
Campanile with \(i_{start}=3\)
Campanile with \(i_{start}=5\)
Campanile with \(i_{start}=7\)
Campanile with \(i_{start}=10\)
Campanile with \(i_{start}=20\)
Campanile
Backpack with \(i_{start}=1\)
Backpack with \(i_{start}=3\)
Backpack with \(i_{start}=5\)
Backpack with \(i_{start}=7\)
Backpack with \(i_{start}=10\)
Backpack with \(i_{start}=20\)
Backpack
Character with \(i_{start}=1\)
Character with \(i_{start}=3\)
Character with \(i_{start}=5\)
Character with \(i_{start}=7\)
Character with \(i_{start}=10\)
Character with \(i_{start}=20\)
Character
I also repeated the same procedure on images from the web and hand-drawn images.
Santa with \(i_{start}=1\)
Santa with \(i_{start}=3\)
Santa with \(i_{start}=5\)
Santa with \(i_{start}=7\)
Santa with \(i_{start}=10\)
Santa with \(i_{start}=20\)
Santa
Painting with \(i_{start}=1\)
Painting with \(i_{start}=3\)
Painting with \(i_{start}=5\)
Painting with \(i_{start}=7\)
Painting with \(i_{start}=10\)
Painting with \(i_{start}=20\)
Painting
Camera with \(i_{start}=1\)
Camera with \(i_{start}=3\)
Camera with \(i_{start}=5\)
Camera with \(i_{start}=7\)
Camera with \(i_{start}=10\)
Camera with \(i_{start}=20\)
Camera
By using a mask to specify regions to edit, we can inpaint images. Here are some examples:
Campanile
Campanile Mask
Campanile Inpainted
Coffee
Coffee Mask
Coffee Inpainted
Emoji
Emoji Mask
Emoji Inpainted
By guiding the denoising process with text prompts, we can create edits that align with the desired description. In this part, I used the text embedding "a childish drawing" to guide the edits. Here are some examples:
Childish Campanile with \(i_{start}=1\)
Childish Campanile with \(i_{start}=3\)
Childish Campanile with \(i_{start}=5\)
Childish Campanile with \(i_{start}=7\)
Childish Campanile with \(i_{start}=10\)
Childish Campanile with \(i_{start}=20\)
Campanile
Childish Character with \(i_{start}=1\)
Childish Character with \(i_{start}=3\)
Childish Character with \(i_{start}=5\)
Childish Character with \(i_{start}=7\)
Childish Character with \(i_{start}=10\)
Childish Character with \(i_{start}=20\)
Character
Childish Emoji with \(i_{start}=1\)
Childish Emoji with \(i_{start}=3\)
Childish Emoji with \(i_{start}=5\)
Childish Emoji with \(i_{start}=7\)
Childish Emoji with \(i_{start}=10\)
Childish Emoji with \(i_{start}=20\)
Emoji
In this part, I created visual anagrams where the image appears as one thing when viewed normally, and another when flipped upside down. Here are some examples:
An oil painting of an old man
An oil painting of people around a campfire
A bowl of noodles
A stadium
An oil painting of a snowy mountain village
A photo of a hipster barista
In this part, I created hybrid images that change appearance based on viewing distance. This uses a similar technique to visual anagrams, but blends high and low frequency components. Here are some examples:
Skull and waterfall
Stadium and noodles
Campfire and Amalfi coast
I implemented the UNet architecture as described in the project instructions using PyTorch. The UNet consists of downsampling and upsampling blocks with skip connections. I defined the necessary operations such as Conv, DownConv, UpConv, Flatten, Unflatten, and Concat, and composed them to create a deeper network.
I trained the UNet to denoise images from the MNIST dataset. For each training batch, I generated noisy images by adding Gaussian noise with varying levels of noise. The model was optimized using an L2 loss function to minimize the difference between the denoised output and the clean images. I first visualized the noising process on sample MNIST digits:
Number 5 Noise
Number 0 Noise
Then, I trained the UNet for 5 epochs and visualized the denoised results on the test set after the 1st and 5th epochs. I used the model as specified in the project instructions with a hidden dimension of 128 and Adam optimizer with a learning rate of 1e-4. Here are the training loss curve and sample results:
Denoise training sample at \(\sigma=0.5\)
Denoise Loss
Here are the full denoising results for each digits:
I also tested the denoiser on out-of-distribution noise levels and observed how the model performed on different noise levels. Additionally, I trained the model to denoise pure Gaussian noise and visualized the generated outputs after 1 and 5 epochs. Here are the results:
Pure Noise Generation after Epoch 1
Pure Noise Generation after Epoch 5
Pure Noise Generation Training Loss
The generated outputs from pure noise exhibited patterns resembling the average of the training images, which is expected due to the nature of the MSE loss function. The model learns to predict the point that minimizes the sum of squared distances to all training examples, leading to outputs that look like the average of the training data distribution. For the higher epochs, the outputs became more refined with smoother edges.
For this part, I modified the UNet architecture to include time conditioning using FCBlocks. The scalar time variable \(t\) was normalized and embedded using two FCBlocks, which were integrated into the Unflatten and UpConv layers of the UNet as described in the project instructions.
I trained the time-conditioned UNet on the MNIST dataset to predict the flow from noisy images to clean images at various timesteps. The model was optimized using the Adam optimizer with an initial learning rate of 1e-2 and an exponential learning rate decay scheduler. Here is the training loss curve:
Time-conditioned UNet Training Loss
After training, I sampled from the time-conditioned UNet for 1, 5, and 10 epochs. The results showed that legible digits emerged as the number of epochs increased, demonstrating the effectiveness of the flow matching approach. Here are the sampling results:
Time-conditioned UNet Sampling after Epoch 1
Time-conditioned UNet Sampling after Epoch 5
Time-conditioned UNet Sampling after Epoch 10
I further modified the UNet to include class conditioning using one-hot vectors for the digit classes (0-9). I implemented dropout to randomly drop the class conditioning vector during training, allowing the model to learn both conditional and unconditional generation.
I trained the class-conditioned UNet using the same procedure as the time-conditioned UNet, with the addition of class conditioning. The training loss curve is shown below:
Class-Conditioned UNet Training Loss
Finally, I sampled from the class-conditioned UNet for 1, 5, and 10 epochs using classifier-free guidance. The results showed that class-conditioning allowed for faster convergence and more accurate digit generation. Here are the sampling results:
Class-Conditioned UNet Sampling after Epoch 1
Class-Conditioned UNet Sampling after Epoch 5
Class-Conditioned UNet Sampling after Epoch 10
Finally, I experimented with training the class-conditioned UNet without the learning rate scheduler. I removed the exponential learning rate scheduler and trained the model using a constant learning rate throughout all epochs. To keep training stable without decay, I lowered the learning rate from \(1\times 10^{-2}\) to \(1\times 10^{-3}\), since a larger fixed step size tended to make the optimization noisy and prevented the loss from settling. With the constant learning rate, the model still converged, but the improvements were more gradual compared to the scheduled run. Below are the sampling results after 10 epochs without the scheduler:
Class-Conditioned UNet Sampling without Scheduler after Epoch 10