Rich human feedback for text-to-image generation

Recent text-to-image generation (T2I) models, such as Stable Diffusion and Imagen, have made significant progress in generating high-resolution images based on text descriptions. However, many generated images still suffer from issues like artifacts (e.g., distorted objects, text and body parts), misalignment with text descriptions, and low aesthetic quality. For example, the prompt in the image below says, “A panda riding a motorcycle”, however the generated image shows two pandas, with additional undesired artifacts, including distorted panda noses and wheel spokes.

Inspired by the success of reinforcement learning from human feedback (RLHF) for large language models (LLMs), we explore whether learning from human feedback (LHF) can help improve image generation models. When applied to LLMs, human feedback can range from simple preference ratings (e.g., “thumb up or down”, “A or B”), to more detailed responses like rewriting a problematic answer. However, current work on LHF for T2I mainly focuses on simple responses like preference ratings, since fixing a problematic image often requires advanced skills (e.g., editing), making it too difficult and time consuming.

In “Rich Human Feedback for Text-to-Image Generation”, we design a process to obtain rich human feedback for T2I that is both specific (e.g., telling us what is wrong about the image and where) and easy to obtain. We demonstrate the feasibility and benefits of LHF for T2I. Our main contributions are threefold:

We curate and release RichHF-18K, a human feedback dataset covering 18K images generated by Stable Diffusion variants.
We train a multimodal transformer model, Rich Automatic Human Feedback (RAHF), to predict different types of human feedback, such as implausibility scores, heatmaps of artifact locations, and missing or misaligned text/keywords.
We show that the predicted rich human feedback can be leveraged to improve image generation and that the improvements generalize to models (such as Muse) beyond those used for data collection (Stable Diffusion variants).

To the best of our knowledge, this is the first rich feedback dataset and model for state-of-the-art text-to-image generation.

Leave a Comment Cancel Reply