visit
Authors:
(1) Arindam Mitra; (2) Luciano Del Corro, work done while at Microsoft; (3) Shweti Mahajan, work done while at Microsoft; (4) Andres Codas, denote equal contributions; (5) Clarisse Simoes, denote equal contributions; (6) Sahaj Agarwal; (7) Xuxi Chen, work done while at Microsoft;; (8) Anastasia Razdaibiedina, work done while at Microsoft; (9) Erik Jones, work done while at Microsoft; (10) Kriti Aggarwal, work done while at Microsoft; (11) Hamid Palangi; (12) Guoqing Zheng; (13) Corby Rosset; (14) Hamed Khanpour; (15) Ahmed Awadall.Teaching Orca 2 to be a Cautious Reasoner
B. BigBench-Hard Subtask Metrics
C. Evaluation of Grounding in Abstractive Summarization
F. Illustrative Example from Evaluation Benchmarks and Corresponding Model Outpu
• LLaMA-2 Models: We use both the 70 billion and 13 billion parameter models from the LLaMA 2 series [57]. We use the LLaMA2-70B-hf-chat[6] and LLaMA2-13B-hf-chat[7].
• WizardLM: WizardLM [64] is an instruction tuned version of LLaMA 2, specifically through the Evol-Instruct technique which autonomously generates a diverse array of intricate instruction data. We use both 13B (V1.2[8]) and 70B (V1.0[9]) parameter versions.
• Orca: Orca 1 [42] is a 13-billion parameter model that learns through explanations, step-by-step thought processes, and complex instructions and is based on the LLaMA model [57].
• GPT Models: We show the performance of both ChatGPT (GPT-3.5-Turbo) and GPT-4 [44]. We utilized the Azure OpenAI API version “2023-03-15-preview”.
• AGIEval: AGIEval [69] is a collection of diverse sets of standardized tests including general college admission tests like the GRE, GMAT, and SAT; law-focused examinations such as the LSAT and lawyer qualification assessments; math competitions; and national civil service examinations [69].
• CRASS: The CRASS [11] dataset evaluates counterfactual reasoning abilities of LLMs.
• RACE: The RACE dataset [27] is a collection of reading comprehension questions derived from English examinations given to Chinese students aged between 12 to 18 years.
• Big-Bench Hard (BBH): BBH [54] is a subset of the 23 hardest tasks of BIG-Bench [52] with a focus on challenging tasks such as those requiring multi-step reasoning.
• GSM8K: This is a collection of word problems that test the ability to perform multi-step mathematical reasoning [9].
• Massive Multitask Language Understanding benchmark: MMLU [17] is designed to measure the language understanding, knowledge and reasoning abilities of models and consists of 57 tasks.
• ARC: The AI2 Reasoning Challenge [8] is a benchmark that tests the ability of text models to answer multiple-choice questions from science exams spanning Grade 3 to Grade 9 with two subsets: Easy and Challenge.
• HellaSwag: A dataset [66] for evaluating commonsense natural language inference. It tests the ability of natural language models to complete text with what might happen next in the scene about physical situations.
• LAMBADA: This dataset [48] is a collection of 10,022 passages from 2,663 novels that tests the ability of natural language models to perform long-range contextual understanding.
• MT-bench: is a benchmark tailored for evaluating the proficiency of chat assistants in multi-turn conversations [67] using GPT-4 as the judge.
• ACI-BENCH: It contains full doctor-patient conversations and associated clinical notes from various medical domains. The task is to generate a clinical note from the dialogue [59].
• MS-MARCO: This dataset [2] is a large-scale collection of natural language questions and answers derived from real web queries and documents.
• QMSum: A benchmark [68] for query-based multi-domain meeting summarization, where models have to select and summarize relevant spans of meetings in response to a query.
• ToxiGen: This is a large-scale, machine-generated dataset [16] of 274,186 toxic and benign statements about 13 minority groups with a focus on implicit hate speech that does not contain slurs or profanity. We use the dataset to test a model’s ability to both identify and generate toxic content.
• HHH: This dataset [53] is benchmark for evaluating the alignment of language models with respect to helpfulness, honesty and harmlessness, where a language model is asked to choose the best response among two options.
• TruthfulQA: A benchmark [30] for evaluating the truthfulness of LLMs in generating answers to questions constructed in a way that humans tend to answer the curated questions falsely due to false believes, biases and misconceptions. The evaluation benchmark contains 817 questions spanning 38 categories (e.g., health, law, finance and politics). We evaluate the models on a multiple-choice variant of the dataset.
• Automated RAI Measurement Framework: We also use a recently proposed framework [34] for evaluating the safety of a given chat-optimized model in conversational setting. Particularly, one LLM poses as a user and engages in a conversation with the LLM under test to evaluate potential harmful content, IP leakage and jailbreaks.
Prompts: We use empty system messages and simple prompts for all models to avoid variations in quality due to prompt engineering, except for general guidelines around answer formats for some task. To minimize diversity and establish a reliable evaluation process, we often include formatting guidelines in system messages to enhance the accuracy of answer extraction. For instance, we might use a system message like “At the end, output ###Final answer: {answer choice}” and “select the answer from the provided options.” Table F shows the prompts used for each dataset. For Orca 2, we report performance with both an “empty” system message and a “cautious” system message. The latter is a generic system message that was described in Section 4.
Answer parsing: Parsing answers from free-form responses from generative models is a difficult task. Therefore, we divided the evaluation tasks into 3 categories based on the type of task and the extraction required, namely:
• MCQ (Multiple-Choice Questions): These tasks require extraction of the option selected as the final answer by the model. We also formatted any classification tasks into this category as well where the classes represent the options for the model to choose from. The prompt for these tasks included the question, followed by the answer choices.
• Exact Match/Span Extraction: These tasks require extraction of the exact final answer in the response or a span from the context provided.
• No extraction required: This category is for tasks that did not require extraction. Open-ended question answering falls into this category.
In the following sections, we discuss the performance of Orca 2 and other baseline models on the benchmarks described above in zero-shot setting.