Authors:
(1) Qian Yang, Zhejiang University, Equal contribution. This work was conducted during Qian Yang’s internship at Alibaba Group;
(2) Jin Xu, Alibaba Group, Equal contribution;
(3) Wenrui Liu, Zhejiang University;
(4) Yunfei Chu, Alibaba Group;
(5) Xiaohuan Zhou, Alibaba Group;
(6) Yichong Leng, Alibaba Group;
(7) Yuanjun Lv, Alibaba Group;
(8) Zhou Zhao, Alibaba Group and Corresponding to Zhou Zhao ([email protected]);
(9) Yichong Leng, Zhejiang University
(10) Chang Zhou, Alibaba Group and Corresponding to Chang Zhou ([email protected]);
(11) Jingren Zhou, Alibaba Group.
Table of Links
Abstract and 1. Introduction
2 Related Work
3 AIR-Bench and 3.1 Overview
3.2 Foundation Benchmark
3.3 Chat Benchmark
3.4 Evaluation Strategy
4 Experiments
4.1 Models
4.2 Main Results
4.3 Human Evaluation and 4.4 Ablation Study of Positional Bias
5 Conclusion and References
A Detailed Results of Foundation Benchmark
Abstract
Recently, instruction-following audio-language models have received broad attention for human-audio interaction. However, the absence of benchmarks capable of evaluating audio-centric interaction capabilities has impeded advancements in this field. Previous models primarily focus on assessing different fundamental tasks, such as Automatic Speech Recognition (ASR), and lack an assessment of the open-ended generative capabilities centered around audio. Thus, it is challenging to track the progression in the Large Audio-Language Models (LALMs) domain and to provide guidance for future improvement. In this paper, we introduce AIR-Bench (Audio InstRuction Benchmark), the first benchmark designed to evaluate the ability of LALMs to understand various types of audio signals (including human speech, natural sounds and music), and furthermore, to interact with humans in textual format. AIR-Bench encompasses two dimensions: foundation and chat benchmarks. The former consists of 19 tasks with approximately 19k single-choice questions, intending to inspect the basic single-task ability of LALMs. The latter one contains 2k instances of open-ended question-and-answer data, directly assessing the comprehension of the model on complex audio and its capacity to follow instructions. Both benchmarks require the model to generate hypotheses directly. We design a unified framework that leverages advanced language models, such as GPT-4, to evaluate the scores of generated hypotheses given the meta-information of the audio. Experimental results demonstrate a high level of consistency between GPT-4 based evaluation and human evaluation. By revealing the limitations of existing LALMs through evaluation results, AIR-Bench can provide insights on the direction of future research.
1 Introduction
Recent advancements in artificial general intelligence have been significantly driven by the emergence of large language models (LLMs) (Brown et al., 2020; OpenAI, 2022, 2023; Chowdhery et al., 2022; Anil et al., 2023; Touvron et al., 2023a,b; Bai et al., 2023a). These models exhibit remarkable abilities in retaining knowledge, engaging in intricate reasoning, and solving problems following human intents. Motivated by the striking progress in large language models (LLMs), the domain of large audio-language models (LALMs) has undergone a revolutionary transformation. To perceive and comprehend rich audio signals and further generate textual responses following human instructions, many works have been proposed, such as SALMONN (Tang et al., 2023a), BLSP (Wang et al., 2023a), Speech-LLaMA (Wu et al., 2023a), and Qwen-Audio (Chu et al., 2023), showcasing promising capabilities for audio-central dialogues.
However, previous LALMs (Tang et al., 2023a; Wang et al., 2023a; Wu et al., 2023a; Chu et al., 2023; Huang et al., 2023b; Shen et al., 2023; Gong et al., 2023; Wang et al., 2023b) have predominantly concentrated on evaluation in specific fundamental tasks. The absence of a standardized benchmark for assessing the generative instructionfollowing abilities of these models has resulted in a reliance on showcasing examples or releasing the chat models for public experimentation to demonstrate their conversational skills. This approach poses significant challenges for conducting fair and objective comparisons across different research endeavors. Moreover, it tends to obscure the models’ existing limitations, impeding the ability to monitor advancements within the domain of LALMs.
For evaluation in audio domains, the majority of research efforts have concentrated on the creation of benchmarks tailored to individual tasks such as LibriSpeech (Panayotov et al., 2015) and Common Voice benchmark (Ardila et al., 2019) for ASR. Beyond task-specific ones, benchmarks like SUPERB (Yang et al., 2021a) and HEAR (Turian et al., 2021) have been designed to test the versatility of self-supervised learning models in a wide variety of tasks. Regarding the assessment of LALMs’ ability to follow instructions, to the best of our knowledge, Dynamic-SUPERB (Huang et al., 2023a) is the only benchmark devoted to this aspect. Nevertheless, Dynamic-SUPERB only focuses on human speech processing, and does not extend to the assessment of models’ capabilities in producing open-ended generations such as dialogues.
In this paper, we present AIR-Bench (Audio InstRuction Benchmark), a novel benchmark designed to evaluate the ability of LALMs to comprehend various audio signals and to interact following instructions. AIR-Bench is characterized by three primary features: 1) Comprehensive audio signals coverage. AIR-Bench offers comprehensive coverage of audio signals, including human speech, natural sounds, and music, ensuring a comprehensive evaluation of LALMs’ capabilities. 2) Hierarchical Benchmark Structure. The benchmark consists of foundation and chat benchmarks. The foundation benchmark comprises 19 distinct audio tasks with over 19,000 single-choice questions, with each question focusing only on a specific foundational ability. GPT-4 (OpenAI, 2023) extends the questions and candidate choices using dedicated designed prompts. The chat component consists of over 2,000 audio-prompted open-ended questions. To enhance the complexity of the audio and achieve a closer resemblance to the intricate audio encountered in real-life situations, we propose a novel audio mixing strategy that incorporates loudness control and temporal dislocation. Specifically, we adjust the loudness and introduce different temporal offsets during the mixing process of two audio clips. The resulting variations in relative loudness and temporal location are then recorded as additional meta-information, contributing to a more comprehensive textual representation of the audio. The quality of data is upheld through automated filtering by GPT-4, followed by manual verification. 3) Unified, objective and reproducible evaluation framework. Models are required to generate hypothesis sequences directly across both benchmarks to align more accurately with practical scenarios. Then, we employ GPT-4 to generate reference answers given meta-information through carefully constructed prompts. Given references and hypotheses, following Liu et al. (2023b); Bai et al. (2023b), we use GPT-4 (OpenAI, 2023) to judge whether the choice is correct for the foundation benchmark or score hypotheses for the chat benchmark. We further perform a second scoring by swapping their positions to eliminate the position bias. Based on comprehensive experiments on 9 LALMs, we observe that existing LALMs either have limited audio understanding or instruction-following capabilities, leaving significant room for improvement in this field.
Our contribution is summarized below:
• AIR-Bench is the first generative evaluation benchmark for large audio-language models, encompassing a wide array of audio such as speech, natural sounds and music. AIR-Bench is a large and hierarchical benchmark, consisting of the foundation benchmark with 19 audio tasks and over 19k single-choice questions, alongside a chat benchmark with over 2k meticulously curated open-ended audio questions for comprehensive evaluation.
• We propose a novel audio mixing strategy with loudness control and temporal dislocation to enhance the complexity of the audio.
• A unified, objective, and reproducible evaluation framework has been developed to assess the quality of generative hypotheses.
• We conducted a thorough evaluation of 9 models for the purpose of benchmarking. The evaluation code, datasets, and an open leaderboard will be made publicly available soon.
This paper is under CC BY 4.0 DEED license.