Fun with diffusion models
Part A: The Power of Diffusion Models!
Part 0: Setup.
In this project, we are using the DeepFlowyd IF diffusion model. This is a two stage model that first generates a 64x64 image, and then uses this image to generate a more fine-grained 256x256 image. The model is text-to-image, meaning it generates images based on text inputs. The parameter num_inference_steps decides how many intermediate steps the model uses. Below is the models' output on the three text inputs provided in the assignment, with num_inference_steps = 20.
Below is the same images, but this time with the parameter num_inference_steps = 100
In general, the model does really well at generating images that matches the prompts. We seen that with num_inference_steps = 100, the imags are more detailed and looks more realistic, which is especially visble in the image of the man wearing a hat. The other images are looking like drawings and not real images, so the effect with different values on num_inference_steps is not as big. The seed I have used througoht this assignment is 137.
Part 1.1: Implementing the forward process.
To train a diffusion model, we need noisy versions of images at different noise levels. To do this, we generate a noisy image and then make a function that gradually includes more of the noise and less of the original image. That is done using this formula: \( x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \) where \( x_t \) is the noisy version at time \( t \), \( x_0 \) is the original image, \( \epsilon \) is the noise image, and \( \bar{\alpha}_t \) is the noise factor. Below is the test image, an image of the Berkeley tower, and a noisy version at t = 250, t = 500, and t = 750.
Part 1.2: Clasical denoising.
A very simple method for denoising an image is to Gaussian blur it. However, this does not work well for this purpose, as we can see from the Gaussian blurring of the noisy versions at t = 250, t = 500 and t = 750 below. The images are definetely less noisy, but the image is also less sharp and the details are not as visible.
Part 1.3: Implementing One Step Denoising.
Next up, we will do one-step denoising using a Unet. This is done on the same image of the Berkeley tower, once again three times, at t = 250, t = 500 and t = 750. The noisy images together with the estimated denoised images are shown below.
This method works a lot better than the Gaussian blur method. There seems to be no noise in any of the images. However, this model also leads to hallucination, and we can see that the higher the t-value, the further away the estimated denoised image is from the original image. This makes a lot of sense since a higher t-value leads to more noise and less preservance of the original image, making it harder to estimate the original image.
Part 1.4: Iterative Denoising.
Next up, we want to use the diffusion model for what it is really made for, iteratvie denoising. This is done by, at each time step, estimate the denoised image, and then interpolate between the noisy image and the estimated denoised image according to the equation:
\( x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'} \beta_t}}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\bar{\alpha}_{t'} (1 - \bar{\alpha}_{t'})}}{1 - \bar{\alpha}_t} x_t + v_\sigma \)
where:
- \( x_t \) is your image at timestep \( t \)
- \( x_{t'} \) is your noisy image at timestep \( t' \) where \( t' < t \) (less noisy)
- \( \bar{\alpha}_t \) is defined by
alphas_cumprod
, as explained above. - \( \alpha_t = \frac{\bar{\alpha}_t}{\bar{\alpha}_{t'}} \)
- \( \beta_t = 1 - \alpha_t \)
- \( x_0 \) is our current estimate of the clean image using equation A.2 just like in section 1.3
Instead of doing this 1000 times, it works well to do it with jumps of 30. Starting at t = 690 (and going down to 0, where 0 is the final, denoised image), the images each 5th time step is shown below.
And finally, the image at t = 0, compared to the original image and the images denoised at the same time step using the one step method and the Gaussian blur method.
As we can see, using the iterative method works best, as this provides a much more detailed image than using the one step method. The difference in detail is especially visible if we upsample the images to 256x256 images:
Part 1.5: Diffusion Model Sampling.
If we set i_start = 0, we effectively generate random images by denoising purely noise images. The results from doing this on 5 images with the prompt "a high quality photo" are shown below.
Part 1.6: Classifier-Free Guidance (CFG)
While we did get some good images usin the last approach, we also sometimes get completely non-sensical images. To improve the images that are generated and make them more realistic, we can use the CFG method. This is done by creating two images, one uncondional (empty prompt) and one conditional, and then make a noise estimate according to the function: ϵ = ϵu + γ(ϵc − ϵu) where ϵu is the noise estimate for the unconditional image, ϵc is the noise estimate for the conditional image. We get the "magic" by using a γ value larger than one. Below are 5 images created with a γ value of 7.
As we can see, these images are generally sensible and looks like real images!
Part 1.7: Image-to-image Translation
Part 1.7.1 Editing Hand-Drawn and Web Images
In this part, instead of creating random images by "forcing" noisy images back onto the manifold of natural images, I will try to recreate the image of the Berkeley tower from noisy versions of it. I will test it at the starting indexes [1, 3, 5, 7, 10, 20]. A starting index of 1 means that the image is pure noise, and a start index of 33 means that the image is the original image. The results are shown below.
The results are as expected; the higher the starting index, the more the similar the image are to the original image, and the easier it is for the model to recreate something similar to the original image. The evoulution of the images are quite interesting. At starting index 1, the image is basically just noise and the generated image is more or less random. At index 3, the main colors in the images are somewhat reporduced, but there is still no sign of a tower. At index 5, there are still no tower, but the pictures shows a woman dressed in a white dress, which is somewhat similar to the white tower. At index 7, we start seing a tower, and as the index increases, the tower becomes more and more similar to the berkeley tower, and so does the surrondings. I also did the same procedure on two of my own images. The results are shown below:
Part 1.7.1 Editing Hand-Drawn and Web Images
Next up, I'm going to use the same approach but this time with drawings. The aim is to end up with a more realistic version of the drawings. I am going to use 1 drawing from the web, and two hand-drawn images. Here is the result on the web image:
For my own drawing, I wanted to draw something really famous, with the hope that the model had been trained on a lot of pictures of this famous thing. I decided to draw the Eiffel tower. The hope was that the model would at some stage recognize the Eiffel tower, and draw a much more realistic version. Here are the results:
This results were very cool! Especially at t10, the model managed to create a realistic photo of the eiffel tower, even though the prompt does not include anyhting about the Eiffel tower! For the second drawing, I wanted to use the same image, but change the background colors to see if we could get the Eiffel tower in a completely different setting. Here are the results:
I think these results were also really cool, as the Eiffel tower at t = 10 and t = 20 now is placed in completely different sceneries.
Part 1.7.2 Inpainting
Another interesting thing we could do is generating an image that has the same content as another image where a mask is 0, but new content whenever the mask is 1. We do this by denoising a random image, but at every step, we force the denoised image to be the same as the noisy version of the original image at the same time step where the mask is 0. In this way, the model will have surrounding context, at the correct noise level, to the new content, so that it is capeable of generating relevant content within the mask. I first did this on the Berkeley tower, where the mask is on the top of the tower.
Next, I did it on a football kit belonging to Hamburg SV. I thought it would be cool to keep the same kit, but make a mask around the bade to make the model generate a new badge. I chose Hamburg SV because they have a rectangular badge, making it easy to find the mask. Here are the results:
For the last image in this part, I used an image of Pedro Rodriguez in a Barcelona kit. I put the mask over his face, so that the modl would generate a new face. My hope was that the model would generate a face that looks like Lionel Messi, as I assume the model is trained on far more picture of Lionel Messi in the same kit than Pedro Rodrigues. Here are the results:
Although the model did not come up with the picture of Lionel Messi, it definetely made a new face! In general, the results from this part was great as the model manage to come up with very realistic, but completely new content on the masked areas, and the content generally matches the surroundings well.
Part 1.7.3 Text-Conditioned Image-to-image Translation
In this part, we are going to denoise images we already have, but by using a prompt that does not match the original image. We will start at different noise levels, same as in the previous sections, and see what the model generates. The theory in advance is that the earlier noise level, the closer the image looks to the prompt, and the higher the noise level, the more the image looks like the original image. The most interesting parts would likely be at the intermediate noise levels, where the images hoepfully will look like a hybrid of the original image and the prompt. Here are the results on the Berkeley tower, with the prompt "a rocket ship":
As expected, the rocket became gradually more and more similar to the Berkeley tower. My next idea was to use the Haburg kit together with the prompt "Manchester United kit", to hoepfully generate som cool new ideas for a Manchester United kit. Here are the results:
I especially liked the last two, and believe that they have potential to become awesome away kits! The last thing I wanted to try, was to use an image of a sail boat with the prompt "shark fint", as the sail looks quite similar to a shark fin. Here are the results:
Although the results where not great, I decided to keep it and not change the idea because I have an interesting theory as to why the results are not great. The model seem to care much about the first word "shark", but not the second, "fin", as it tries to generate images of whole sharks rather than of shark fins. This suggests that there are improvmnets to be made when it comes to the attention layers in the transfomer model, as it ideally should understand that the "fin" part of the prompt is also important.
Part 1.8 Visual Anagrams
In this part we are creating optical illusions by making visual anagrams. The images looks like one thing when you first look at it, but when you flip it, it looks like something else. This is done by following the same denoising process as above, but instead of having one prompt, you have two. At each time step we create two denoised images; one version of the image as is with one of the prompts, and another version of the current image flipped upside down with the other prompt. We then obtain two noise estimates, which we combine by averaging them. We then use this as the noise estimate for the current time step. Below is this process shown with three different sets of prompts:
I used the term oil painting in all the scenarios because I figured out that this worked better than trying to create anagrams of real images. The results were quite cool, and I especially liked the cat fish anagram.
Part 1.9 Hybrid Images
The final part of part A of this assignment is about creating hybrid images. Hybrid images in this sense, is an image that looks like one thing when you are close to it, but another thing when you are far away from it. This is generally done by combining a low-pass filtered version of one image with a high-pass filtered version of another image. The way we do it using the diffusion model is that we iteratively denoise one image in two ways, using two different prompts. Then we low-passfilter the estimated noise from the first prompt, and high-pass filter the estimated noise from the second prompt. We then combine the noise by adding them together, and use this as the estimated noise. The results on three different set of prompts are shown below:
Part B: Diffusion Models from Scratch!
Part 1: Training a Single-Step Denoising UNet
Part 1.1: Implementing the UNet
The first part is about implementing the UNet that we are going to use in this exercise. To implement the UNet, we must first implement the building blocks that the UNet consists of. These are:
- Convolution
- Down convolution
- Up convolution
- Flatten
- Unflatten
- Concatenate
- Convolutional block
- Down convolutional block
- Up convolutional block
After these blocks are implemented, we need to build the network by using these blocks according to the arcitecture provided in the task description. To do this we need to figure out the correct order of building blocks and the correct input and output sizes of each block. There is no images to show from this part, but the UNet is used in the next parts.
Part 1.2: Using the UNet to Train a Denoiser
In this part, we are going to use the UNet to denoise images from the MNIST dataset. The MNIST dataset is a legendary dataset of handrawn images. To train a model that aims at denoising such images, we first need to have noisy versions of the images. This is done by adding noise to the training set according to this formula: \( z = x + \sigma \epsilon, \quad \text{where} \quad \epsilon \sim \mathcal{N}(0, I). \) Where \sigma is the noise level. The result on the MNIST dataset with a noise level of 0.0, 0.2, 0.4, 0.6, 0.8 and 1.0 are shown below:
Part 1.2.1 Training
Now that we have the training data, it is time to train the model. The UNet is trained two times, with the following parameters:
- Noise level: 0.5
- Batch size: 256
- Learning rate: 0.001
- Number of epochs: 5
- Optimizer: Adam
- Noise level: 0.5
- Batch size: 256
- Learning rate: 0.001
- Number of epochs: 1
- Optimizer: Adam
Here is the result after training on 5 epochs:
As we can see, the loss function keeps decreasing after 1 epoch and is relatively much lower after 5 epochs than after 1 epoch. The results on the test set also shows that the model is able to denoise the images better after 5 epochs than after 1, suggesting that the model has learned more without overfitting to the data.
Part 1.2.2: Out-of-Distribution Testing
The final part of part 1 is testing how well the model that is trained on the noise level = 0.5 is at denoising images with different noise level. We will test the model on the test set with noise levels 0.0, 0.2, 0.4, 0.5, 0.6, 0.8 and 1.0. The images at the different noise levels are shown below: The results are shown below:
We see that the model is generally performing well on different noise levels, but that it gradually struggles more after noise level 0.5. This is expected, as the model is trained on noise level 0.5, and therefore should perform best on this noise level, but it should also perform well on lower noise levels as there is simply less noise to denoise. With noise level = 1, the model is not great, as the 4 number looks more like a 9.
Part 2: Training a diffusion model
In this part, we are going to modify our UNet to turn it into a diffusion model. A diffusion model is trained at iteratively denoising an image. That means the model need to be able to "adapt" the denoising to the level of noise that is apparent in the image. To do this, we can train a model on a variety of different noise levels by providing it context about the noise level to a timestep feature t.
Part 2.1: Adding Time Conditioning to UNet
For the model to know which timestep t we are in, we need to somehow add the timestep as a feature. This is done by adding two FCBlocks to our UNet, one after the unflatten block and one after the first upblock. The FCBlocks takes in two tensors containin block size amount of timesteps, add a linear layer followed by a GELU operation and yet another linear layer. The output off the FCBlocks are then added to the output of the Unflatten block and the output of the first UPBlock.
Part 2.2: Training the UNet
Now that the model is aware of which timestep we are at, it is time to train the model. The model is trained to estimate the noise at the current time step by minimizing the distance between the estimated noise and the true noise. The model is trained on the MNIST dataset with batch size = 128 and hidden layers = 64. We use the Adam optimizer with an initial learning rate of 1e-3. The model is trained for 20 epochs. We use an exponential learning rate decay scheduler with a gamma 0.1^(1/gamma). For the training algorithm, we need to calculate some parameters, alpha, beta and alpha_bar. This is done by using the following recipe:
- Create a list \( \beta \) of length \( T \) such that \( \beta_0 = 0.0001 \) and \( \beta_T = 0.02 \) and all other elements \( \beta_t \) for \( t \in \{1, \cdots, T-1\} \) are evenly spaced between the two.
- \( \alpha_t = 1 - \beta_t \)
- \( \bar{\alpha}_t = \prod_{s=1}^t \alpha_s \) is a cumulative product of \( \alpha_s \) for \( s \in \{1, \cdots, t\} \).
Part 2.3: Sampling from the UNet
After the model is trained, it is time to sample from it. We sample from the model by first generating a random image. Then we iterate from t = 299 to t = 0. At each time step, we deploy the following algorithm: Generate a noisy image Z if t > 1. Make a forward pass through the model with the noisy image and the timestep t to generate a noise estimate. Find an estimate of the clean image by deploying the formula: \[ \hat{\mathbf{x}}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) \] Then, finally, make an estimate of the image at the next time step by deploying the formula: \[ \mathbf{x}_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} \hat{\mathbf{x}}_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \mathbf{x}_t + \sqrt{\beta_t} \mathbf{z} \] The training loss curve, together with the result on 5 images with epochs = 5 and epochs = 20 are shown below:
As we can see, the model is able to generate images that looks similar to the MNIST dataset. However, not all the images that are generated look like actual numbers. This is especially the case for epochs = 5. At epochs = 20, the generated images more often look more similar to actual numbers.
Part 2.4: Adding Class-Conditioning to UNet
To improve our diffusion model, we can add one more feature that represents the number we want to generate. We do this be one hot encoding the number, then send this into the same type of FCBlock's as was used for the timestep. The outputs from these are, as for the timesteps, added to the output of the Unflatten block and the output of the first UPBlock. To make the UNet work also without class labelling, we implement dropout where we 10% of the time set the one hot encoding to zero. Beside from the adding of c, the training is exactly as in the last step.
Part 2.5: Sampling from the Class-Conditioned UNet
When we sample from the Class-Conditioned UNet, we use the same trick as we did in part A. At each iteration, we generate two noise estimates, one condtioned with c = the number we want to generate, and one conditioned with c = 0. We then use the formula \[ \epsilon = \epsilon_u + \gamma (\epsilon_c - \epsilon_u) \] where gamma = 5. We then use epsilon as the noise estimate, and the rest of the algorithm is exactly as in the previous part. The results on 40 images, 4 of each digit 0-9, on epochs = 5 and epochs = 20, in addition to the loss function evolution are shown below:
As we can see, the results are very good already after epochs = 5. The images clearly show the correct number. The difference between 5 and 20 epochs are not very big, sggesting that 5 epochs is enough to train the model.