The first part of project 5 is about using pretrained diffusion models to generate images from text prompts.
I set up the environment as provided in the project instructions, and used a random seed 3036155160
for reproducibility. Here are some sample images generated using the DeepFloyd IF diffusion model with different
numbers of inference steps:
A graphic of yellow whale (20 steps)
A graphic of blue starfish (20 steps)
A cheap backpack (20 steps)
A graphic of yellow whale (120 steps)
A graphic of blue starfish (120 steps)
A cheap backpack (120 steps)
In this part of the problem set, I wrote my own "sampling loops" that use the pretrained DeepFloyd denoisers. These produce high quality images such as the ones generated above. I then modified these sampling loops to solve different tasks such as inpainting or producing optical illusions.
A key part of diffusion is the forward process, which takes a clean image and adds noise to it. In this part, I wrote a function to implement this. The forward process is defined by:
Campanile
Noisy Campanile at \(t=250\)
Noisy Campanile at \(t=500\)
Noisy Campanile at \(t=750\)
One simple way to denoise an image is to use Gaussian blur filtering. Here are some results:
Gaussian Blur at \(t=250\)
Gaussian Blur at \(t=500\)
Gaussian Blur at \(t=750\)
Using a pretrained diffusion model, we can denoise images in one step. Here are the results:
One-Step Denoising at \(t=250\)
One-Step Denoising at \(t=500\)
One-Step Denoising at \(t=750\)
By iteratively denoising an image, we can achieve better results. Here are the results:
Iterative Denoising at \(t=660\)
Iterative Denoising at \(t=510\)
Iterative Denoising at \(t=360\)
Iterative Denoising at \(t=210\)
Iterative Denoising at \(t=60\)
Campanile
Iteratively Denoised Campanile
One-Step Denoised Campanile
Gaussian Blur at \(t=750\)
By starting from pure noise and iteratively denoising, we can generate new images. Here are some samples:
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Using Classifier-Free Guidance, we can improve the quality of generated images. Here are some samples:
CFG Sample 1
CFG Sample 2
CFG Sample 3
CFG Sample 4
CFG Sample 5
By adding noise to an image and then denoising it with a text prompt, we can create interesting edits. Here are some examples:
Campanile with \(i_{start}=1\)
Campanile with \(i_{start}=3\)
Campanile with \(i_{start}=5\)
Campanile with \(i_{start}=7\)
Campanile with \(i_{start}=10\)
Campanile with \(i_{start}=20\)
Campanile
Backpack with \(i_{start}=1\)
Backpack with \(i_{start}=3\)
Backpack with \(i_{start}=5\)
Backpack with \(i_{start}=7\)
Backpack with \(i_{start}=10\)
Backpack with \(i_{start}=20\)
Backpack
Character with \(i_{start}=1\)
Character with \(i_{start}=3\)
Character with \(i_{start}=5\)
Character with \(i_{start}=7\)
Character with \(i_{start}=10\)
Character with \(i_{start}=20\)
Character
I also repeated the same procedure on images from the web and hand-drawn images.
Santa with \(i_{start}=1\)
Santa with \(i_{start}=3\)
Santa with \(i_{start}=5\)
Santa with \(i_{start}=7\)
Santa with \(i_{start}=10\)
Santa with \(i_{start}=20\)
Santa
Painting with \(i_{start}=1\)
Painting with \(i_{start}=3\)
Painting with \(i_{start}=5\)
Painting with \(i_{start}=7\)
Painting with \(i_{start}=10\)
Painting with \(i_{start}=20\)
Painting
Camera with \(i_{start}=1\)
Camera with \(i_{start}=3\)
Camera with \(i_{start}=5\)
Camera with \(i_{start}=7\)
Camera with \(i_{start}=10\)
Camera with \(i_{start}=20\)
Camera
By using a mask to specify regions to edit, we can inpaint images. Here are some examples:
Campanile
Campanile Mask
Campanile Inpainted
Coffee
Coffee Mask
Coffee Inpainted
Emoji
Emoji Mask
Emoji Inpainted
By guiding the denoising process with text prompts, we can create edits that align with the desired description. In this part, I used the text embedding "a childish drawing" to guide the edits. Here are some examples:
Childish Campanile with \(i_{start}=1\)
Childish Campanile with \(i_{start}=3\)
Childish Campanile with \(i_{start}=5\)
Childish Campanile with \(i_{start}=7\)
Childish Campanile with \(i_{start}=10\)
Childish Campanile with \(i_{start}=20\)
Campanile
Childish Character with \(i_{start}=1\)
Childish Character with \(i_{start}=3\)
Childish Character with \(i_{start}=5\)
Childish Character with \(i_{start}=7\)
Childish Character with \(i_{start}=10\)
Childish Character with \(i_{start}=20\)
Character
Childish Emoji with \(i_{start}=1\)
Childish Emoji with \(i_{start}=3\)
Childish Emoji with \(i_{start}=5\)
Childish Emoji with \(i_{start}=7\)
Childish Emoji with \(i_{start}=10\)
Childish Emoji with \(i_{start}=20\)
Emoji
In this part, I created visual anagrams where the image appears as one thing when viewed normally, and another when flipped upside down. Here are some examples:
An oil painting of an old man
An oil painting of people around a campfire
A bowl of noodles
A stadium
An oil painting of a snowy mountain village
A photo of a hipster barista
In this part, I created hybrid images that change appearance based on viewing distance. This uses a similar technique to visual anagrams, but blends high and low frequency components. Here are some examples:
Skull and waterfall
Stadium and noodles