Quality-Diversity through AI Feedback (QDAIF) revolutionizes creative text generation by combining language models with the MAP-Elites algorithm. The experiments cover diverse domains such as opinions, short stories, and poetry, showcasing QDAIF's superior performance over baseline methods. Human evaluations affirm the alignment between AI and human perspectives, marking a significant step in the evolution of AI-driven creative processes.
Authors:
(1) Herbie Bradley, CarperAI, CAML Lab, University of Cambridge & EleutherAI;
(2) Andrew Dai, Aleph Alpha;
(3) Hannah Teufel, Aleph Alpha;
(4) Jenny Zhang, 5Department of Computer Science, University of British Columbia & Vector Institute;
(5) Koen Oostermeijer, Aleph Alpha;
(6) Marco Bellagente, Stability AI;
(7) Jeff Clune, Department of Computer Science, University of British Columbia, Vector Institute & Canada CIFAR AI Chair;
(8) Kenneth Stanley, Maven;
(9) Grégory Schott, Aleph Alpha;
(10) Joel Lehman, Stochastic Labs.
4 EXPERIMENTS ON CREATIVE WRITING DOMAIN
4.1 SETUP: OPINION WRITING, SHORT STORIES
To demonstrate the versatility of QDAIF in different applications of creative text evolution, we evaluated QDAIF on these domains: Opinions, and Stories. The Opinions domain is focused on generating diverse, realistic pieces about one’s opinions on eating vegetables and plant-based foods - the diversity measure is based on the sentiment of opinions on this topic (e.g. shown in example texts in the Figure 2 overview). For the Stories domain, the topic is about a short story, containing two characters: a spy, and a politician. The diversity of stories is evaluated using a variety of measures based on AI Feedback, with the main ones being: Stories - Genre (Romance vs Horror) (1D archive), Stories - Ending (Happy vs Tragic) (1D archive), and Stories - Genre and Ending (2D archive). These domains capture the strengths and limitations of all methods, ranging from simple (Opinions) to challenging (Stories - Genre and Ending). We show in Figure 1 that the 2D domain is challenging, yet QDAIF still outperforms the baseline in filling the archive with diverse, high-quality stories. The AI feedback prompts are outlined in Appendix A.23.
Evaluation. To assess the performance of methods in creative writing generation, we compute QD scores (Pugh et al., 2016), a standard metric used to measure the quality-diversity of the discovered corpus of texts. A QD score is defined as the sum of the highest quality values found in each bin. To understand the alignment between AI and human feedback for practical applications in QDAIF, we conducted a human evaluation study on selected elite samples from each method (chosen from the median QD score run out of 5 random seed runs). Using a Likert scale (Allen & Seaman, 2007) for quality assessment, we evaluate the capability of each method to produce a collection of diverse, high-quality texts. To do so we calculate a “human” QD score, defined as the sum of quality scores given for all diversity categories identified by the annotator within the set. Furthermore, to understand how closely AI feedback aligns with human perspectives on subjective evaluation, we measured the agreement rates between human annotators and AI and between two human annotators. Details of the human study are specified in Appendix A.1, demonstrating the validity and advantages of AI feedback in generating human-like feedback on subjective quality and diversity measures.
4.2 COMPARISONS BETWEEN QDAIF AND BASELINES
To evaluate the strengths and limitations of QDAIF in generating high-quality and diverse creative writing texts, we compared our method against the following baseline methods:
• Fixed-Few-Shot: Use a fixed few-shot prompt (cf. Appendix A.21) to sample many completions for creative domain texts (i.e. no iterative search).
• Shuffling-Few-Shot: Shuffle the in-context examples of Fixed-Few-Shot prompt, before sampling the completion from this prompt.
• Random-Search: Create a prompt pool of examples (initialized from examples in Appendix A.21), add all completions from few-shot prompting to the pool, and choose fewshot examples from the growing pool (without pool size limit).
• LMX, Quality-Only: Maintain a pool as in Random-Search, but only up to 100 highestquality completions (as evaluated by AI Feedback) are kept in the pool (i.e. iterative search focused only on quality). Single-objective LMX as in Meyerson et al. (2023).
We choose a variety of baselines, some highlighting representative alternative approaches (e.g. fewshot prompting), and ablations of QDAIF, to validate our algorithmic choices. For example, FixedFew-Shot and Shuffling-Few-Shot enable the generation of different texts (relying on stochastic sampling), while being constrained to the output distribution of the fixed set of in-context examples. Random-Search and LMX, Quality-Only are methods where the (prompt) pool of examples that we can sample from grows in size, starting from an initial pool. In contrast to QDAIF, RandomSearch is limited by the lack of constraints in the prompt pool, especially in maintaining the quality of the growing pool through evaluation. LMX, Quality-Only adds a quality-based evaluation step with AI feedback that optimizes the quality of texts in the pool over time to contain only texts with high quality scores, but is not designed to encourage diversity in texts in comparison to QDAIF.
For each domain and baseline described above, we recorded runs for 2000 iterations, repeated with 5 different random seeds. To enable comparisons with baselines, AI feedback is used to compute the quality and diversity measures for all iteration outputs. Similar to QDAIF, the baseline methods use hand-written examples in Appendix A.21, either in a fixed prompt or as the initial prompt pool.
Performance Comparison. We report results comparing the QD score performance for the different methods of generating creative writing texts in Figure 3. We computed the mean, and the bootstrapped 95% confidence intervals (CI) from 100k resamples, across 5 random seeds of runs. We noticed that QDAIF achieves significantly better QD scores than all baseline methods in Opinions and Stories. The broader range of texts generated by QDAIF is also evident qualitatively. For example, in the Stories - Genre and Ending domain, while the baseline methods deliver straightforward and more predictable stories of how the spy "pulled out a knife and stabbed [the politician] to death", QDAIF generates a more dramatic story of how the spy "transformed into the monster and killed everyone". Random-Search is the worst-performing method overall, with significantly lower QD score performance in Opinions and Stories - Genre compared to the best-performing baselines. Interestingly, LMX, Quality-Only does not significantly outperform the methods using a fixed population prompt pool (Fixed-Few-Shot and Shuffling-Few-Shot). On Stories - Genre, LMX, Quality-Only is often weaker than Fixed-Few-Shot and Shuffling-Few-Shot. The results show that single-objective optimization cannot guide the search for diverse, high-quality texts alone. Furthermore, we found that QDAIF outperforms other (diversity-seeking baselines), as well as extensions of those baselines with quality filters, becoming more similar to QDAIF (cf. Appendix A.8).
Human Feedback Evaluation. We report the results of the human study comparing QDAIF and baseline samples in Table 1. We observe that compared to baselines, QDAIF is competitive with or better at discovering diverse, high-quality texts in Opinions and Stories according to human feedback. QDAIF sets also showed high agreement between humans and AI feedback on the diversity categories of presented texts, as well as between two annotators, competitive with Fixed-Few-Shot. Although the average perceived quality of texts is better from Fixed-Few-Shot, this is not enough for bringing high-quality examples for different niches of outputs (i.e. higher QD score). Furthermore, Shuffling-Few-Shot demonstrates even lower human evaluation scores, despite the use of the same fixed set of hand-written seeds, indicating lower robustness due to the use of different ordering of few-shot examples. Prior work hints at the sensitivity of LMs to few-shot prompt ordering, with task-solving capabilities varying significantly due to this ordering (Lu et al., 2021). The gap in human-evaluated performance between Fixed-Few-Shot and Shuffling-Few-Shot indicates that reliance on fixed prompts is less likely to enable reliable search, in contrast to the robustness shown by QDAIF. Additionally, Random-Search and LMX, Quality-Only obtained even lower human evaluation scores, even though the methods either explore different prompts or optimize for the quality of texts. We provide a detailed discussion (for baseline methods) on findings from the subjective study of the discovered texts in Appendix A.15, as well as the qualitative behavior of the text search over time in Appendix A.16. Through guided evolutionary search, QDAIF surpasses all baseline methods in terms of computed QD score performance, and is competitive (or better) compared to baselines, according to human evaluation.
4.3 EXTENSIONS TO AI FEEDBACK AND MUTATION MODEL
In addition to experiments with QDAIF described in previous sections, we investigated the effects on the performance due to variations of the method.
LMX Model Size. We used larger versions of the LMX models (30B and 70B) for mutation, and compared it to the performance of the 13B model (default). While no relationship was found between model size and QD score, quality ratings from human feedback improved with outputs from larger models (described in detail in Appendix A.3).
Few-Shot AI Feedback. We compared the performance of QDAIF on the Stories - Genre domain when we prompted our AI feedback model for diversity measures given 2-shot, 4-shot, and 8-shot prompts. Using a higher number of few-shots led to improvements in human quality ratings of texts. Further discussion and results are highlighted in Appendix A.4.
Varying Initialization and Mutation Method. Ideally, QDAIF would be simpler if it could be run without seed examples (e.g. requesting a story from an instruction-following LM). We investigated the potential of QDAIF when the solution population is initialized from zero-shot prompted generations, and evolved using LMX. Initial results on Opinions and Stories along with prior work discussion are shown in Appendix A.5, highlighting comparable performance in terms of QD score, with divergence observed in alignment with human preferences. We also find that QDAIF, in these domains, is robust to the mechanism of generating variation. Appendix A.6 describes an alternative mutation method (based on a more gradual few-shot replacement operation) that is more effective in some circumstances, although in general offers comparable performance. We provide a detailed discussion (for QDAIF methods) on findings from the subjective study of the discovered texts in Appendix A.13, as well as the qualitative behavior of the text search over time in Appendix A.14.
There may be many ways of leveraging LMs to generate variation (including finetuning models; see Appendix A.11), and this is an exciting avenue of future research.
4.4 EVOLVING SOLUTIONS THROUGH INSTRUCTION GUIDANCE
This experiment explores scaling up QDAIF to a more capable model, GPT-4 (OpenAI, 2023), in the challenging task of generating poetry, highlighting how QDAIF will benefit from advances in model capabilities. The aim of the Poetry domain is to produce high-quality poems with varying genres and emotional tones, unrestricted by topic. Here, the MAP-Elites archive has two axes of diversity: genre and tone, and they are delineated using categorical labels. The genre axis has the labels "haiku", "sonnet", "ballad", "limerick", and "hymn", while the tone axis has the labels of "happy", "dark", "mysterious", "romantic", and "reflective". We created a new mutation operator, LMX-rewrite, for this domain, that leverages instruction-following to tell a model to translate a parent poem into an offspring with different diversity characteristics. To generate a new solution, we prompt GPT-4 to rewrite the selected poem into an inspired, different poem: Inspired by this poem, write a new poem of very high, award winning quality with a different poetic genre (format and form) and tone compared to the poem above.
We used GPT-4 to determine the genre and tone of a poem and rate its quality. For quality, we prompt GPT-4 to Rate the quality of the above poem on a scale from 1 to 10. To determine the diversity attributes, we prompt GPT-4 to determine which genre or tone is the poem closest to. For example, to determine genre, we ask GPT-4: What genre is this poem closest to from the following list: ["haiku", "sonnet", "ballad", "limerick", "hymn"]? Appendix A.30 shows the full prompts and setup. We observed high consistency in GPT-4’s responses; across multiple LM calls, quality ratings for the same poem fluctuated by no more than a single point. We show qualitative examples of poems with their AI feedback evaluations in Appendix A.19. In this setup, all methods are run for 200 iterations, and QDAIF is benchmarked against the baseline, Random-Poems, as well as an ablation method, Fixed Seed Rewrite. Random-Poems simply generates 200 random poems. Fixed Seed Rewrite rewrites only the seed poem in Appendix A.30, without evolutionary search.
We found that QDAIF achieves a higher QD score of 130 (CI: 118 - 145) in comparison to RandomPoems with 76 (CI: 67 - 85) and Fixed Seed Rewrite with 99 (CI: 72 - 117). We observed a similar trend (with a wider performance gap between QDAIF and other methods) when we used GPT-3.5-Turbo for the generation step instead of GPT-4 while keeping GPT-4 as the evaluator (cf. Appendix A.17). QDAIF was shown to have greater QD score performance than the other methods according to the Mann-Whitney U Test (p ≤ 0.05). Still, QDAIF with LMX-rewrite fails to discover solutions in some bins (e.g. limericks). For this, we can adapt QDAIF by adding guidance on the desired (randomly chosen) genre and tone for rewriting (LMX-guided). The performance of this method of QDAIF is on par with a Targeted-Poems approach (generating high-quality poems of randomly chosen genre and tone per step) in terms of QD score, and even better when using an older version of GPT-4 (cf. Appendix A.17). Furthermore, we found that the rewriting step is useful for generating poems that can meaningfully preserve inspiration from parent poems, enabling users to control with more nuance the outcomes of search, through approximating the iterative refinement typical of human creative processes (Stanley et al., 2017) (see Appendix A.18). Figure 4 highlights the potential of QDAIF with GPT-4, especially in controlling a set of solutions subjectively aligned with AI feedback labels. QDAIF is even demonstrated for practical applicability to a domain outside of creative writing, for solving coding problems (cf. Appendix A.20). Overall, results highlight that diversity suffers when prompting models without explicit diversity guidance (Renda et al., 2023; Friedrich et al., 2023; Kirk et al., 2023), and that evolution with a rewriting mutation operator can lead to more human-influenceable, diverse, and high-quality solutions.