UC Berkeley – GPT-4 + Stable-Diffusion = ?: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

TL;DR: Text Prompt -> LLM -> Intermediate Representation (such as an image layout) -> Stable Diffusion -> Image.

Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, despite their impressive capabilities, diffusion models, such as Stable Diffusion, often struggle to accurately follow the prompts when spatial or common sense reasoning is required.

The following figure lists four scenarios in which Stable Diffusion falls short in generating images that accurately correspond to the given prompts, namely negation, numeracy, and attribute assignment, spatial relationships. In contrast, our method, LLM-grounded Diffusion (LMD), delivers much better prompt understanding in text-to-image generation in those scenarios.

Figure 1: LLM-grounded Diffusion enhances the prompt understanding ability of text-to-image diffusion models.

One possible solution to address this issue is of course to gather a vast multi-modal dataset comprising intricate captions and train a large diffusion model with a large language encoder. This approach comes with significant costs: It is time-consuming and expensive to train both large language models (LLMs) and diffusion models.

Our Solution

To efficiently solve this problem with minimal cost (i.e., no training costs), we instead equip diffusion models with enhanced spatial and common sense reasoning by using off-the-shelf frozen LLMs in a novel two-stage generation process.

First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We invite readers to read the paper on arXiv for additional details.

Figure 2: LMD is a text-to-image generative model with a novel two-stage generation process: a text-to-layout generator with an LLM + in-context learning and a novel layout-guided stable diffusion. Both stages are training-free.

LMD’s Additional Capabilities

Additionally, LMD naturally allows dialog-based multi-round scene specification, enabling additional clarifications and subsequent modifications for each prompt. Furthermore, LMD is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

Figure 3: Incorporating an LLM for prompt understanding, our method is able to perform dialog-based scene specification and generation from prompts in a language (Chinese in the example above) that the underlying diffusion model does not support.

Given an LLM that supports multi-round dialog (e.g., GPT-3.5 or GPT-4), LMD allows the user to provide additional information or clarifications to the LLM by querying the LLM after the first layout generation in the dialog and generate images with the updated layout in the subsequent response from the LLM. For example, a user could request to add an object to the scene or change the existing objects in location or descriptions (the left half of Figure 3).

Furthermore, by giving an example of a non-English prompt with a layout and background description in English during in-context learning, LMD accepts inputs of non-English prompts and will generate layouts, with descriptions of boxes and the background in English for subsequent layout-to-image generation. As shown in the right half of Figure 3, this allows generation from prompts in a language that the underlying diffusion models do not support.

Visualizations

We validate the superiority of our design by comparing it with the base diffusion model (SD 2.1) that LMD uses under the hood. We invite readers to our work for more evaluation and comparisons.

Figure 4: LMD outperforms the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. LMD also enables counterfactual text-to-image generation that the base diffusion model is not able to generate (the last row).

For more details about LLM-grounded Diffusion (LMD), visit our website and read the paper on arXiv.

BibTex

If LLM-grounded Diffusion inspires your work, please cite it with:

@article{lian2023llmgrounded,

    title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models},

    author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},

    journal={arXiv preprint arXiv:2305.13655},

    year={2023}

}

Latest from MIT : Artificial intelligence model can detect Parkinson’s from breathing patterns

Parkinson’s disease is notoriously difficult to diagnose as it relies primarily on the appearance of motor symptoms such as tremors, stiffness, and slowness, but these symptoms often appear several years after the disease onset. Now, Dina Katabi, the Thuan (1990) and Nicole Pham Professor in the Department of Electrical Engineering and Computer Science (EECS) at…

Artificial Intelligence

Latest from MIT Tech Review – Roomba testers feel misled after intimate images ended up on Facebook

When Greg unboxed a new Roomba robot vacuum cleaner in December 2019, he thought he knew what he was getting into. He would allow the preproduction test version of iRobot’s Roomba J series device to roam around his house, let it collect all sorts of data to help improve its artificial intelligence, and provide feedback…

Artificial Intelligence

Latest from IBM Developer : Improve Watson Discovery results using API-based relevancy training

Summary Developers use the IBM Watson Discovery service to rapidly add a cognitive, search, and content analytics engine to applications. With that engine, they can identify patterns, trends, and insights from unstructured data that can drive better decision making. Sometimes, you want to improvise the search results by providing more training details. Relevance training is…

Artificial Intelligence

Latest from Google AI – PRESTO – A multilingual dataset for parsing realistic task-oriented dialogues

Posted by Rahul Goel and Aditya Gupta, Software Engineers, Google Assistant Virtual assistants are increasingly integrated into our daily routines. They can help with everything from setting alarms to giving map directions and can even assist people with disabilities to more easily manage their homes. As we use these assistants, we are also becoming more…

Artificial Intelligence

Latest from MIT Tech Review – Can we repair the internet?

From addictive algorithms to exploitative apps, data mining to misinformation, the internet today can be a hazardous place. Books by three influential figures—the intellect behind “net neutrality,” a former Meta executive, and the web’s own inventor—propose radical approaches to fixing it. But are these luminaries the right people for the job? Though each shows conviction,…

Artificial Intelligence

Latest from Google AI – Google Research embarks on effort to map a mouse brain

Posted by Michał Januszewski, Research Scientist, Google Research The human brain is perhaps the most computationally complex machine in existence, consisting of networks of billions of cells. Researchers currently don’t understand the full picture of how glitches in its network machinery contribute to mental illnesses and other diseases, such as dementia. However, the emerging connectomics…

Our Solution

LMD’s Additional Capabilities

Visualizations

BibTex

Similar Posts