visit
“Enhancing Image Generation Models Using Photogenic Needles in a Haystack,” aka. Emu is a released by Meta to generate highly aesthetic images. It achieves this by introducing a fine-tuning technique called quality tuning, which goes on to show that as little as two thousand high-quality images are all that is needed to achieve the objective, hence the paper title, “needles in a haystack.”
In this article, let’s look at the model architecture and the process of collating the quality dataset to train the Emu model, and, lastly, some of the mind-blowing results achieved by Emu.
Emu is trained in two stages, namely pre-training and quality fine-tuning. Though this may be a well-known recipe, the main message from this paper is that it all comes down to quality rather than quantity. So the authors have not only curated such a quality fine-tuning dataset but shown that such a fine-tuning does not compromise on the generality of the model as measured by the faithfulness metric.
The first contribution of the paper is the modifications to the latent diffusion architecture. In that first comes the autoencoder, which has the encoder and a decoder. The autoencoders typically use four channels in their intermediate layers, thereby compressing the data and restricting their ability to represent fine details. In this work, the authors have increased the number of layers in the intermediate channels from 4 to 8 and 16. To increase the input channels further, they do a Fourier feature transform to lift the input channel dimension from 3 (RGB) to even higher dimensions.
The next modification to the architecture is that of the U-Net used for denoising. To increase the capacity of the U-Net, they increase the channel size and the number of stacked residual blocks in each stage. And finally, they modify the decoder to output images at a high resolution of 1024 by 1024.
The other thing is the introduction of noise offset in the final stages of pre-training. It was introduced in
The offset can be introduced with 1 line of code modification in PyTorch, where we add a small offset of 0.1 or similar to the noise generation process.
For this, the authors of the paper employ a two-stage filtering process, namely automatic filtering and manual filtering.
Similarly, portraits taken using professional photography gear are quite appealing compared to ordinary images because the background is all blurred, creating what is called a bokeh in photography. So it’s much more aesthetically appealing compared to a selfie taken on, say, a mobile phone.
For evaluation, they were mainly concerned about the two metrics — visual appeal and faithfulness. They ditch the FID scores as many recent papers argue that FID scores do not correlate well with human assessment of the performance of generative models.
Visual appeal is subjective, so the generated images were shown to 5 annotators, usually with generated images from two models shown side by side, and the annotators chose which one of the two was more appealing. For example, if we are comparing the pre-trained model with the quality-tuned model (as shown in the figure above), then these two images are shown to the annotators who are asked to choose which one of the two images is visually appealing. I would personally choose the quality-tuned one straightaway.
Lastly, the paper shows that quality tuning is not just restricted to the latent diffusion models but is applicable to other models like pixel diffusion and masked generative transformers. As seen from the figure above, quality tuning, when applied to these models, also improves the visual aesthetics of the model.
In my opinion, this paper is a clear eye-opener to show how small the fine-tuning dataset can be to tune LLMs or VLMs. Emu generates results that are so good that the generated images can be used in commercial applications.