visit
Authors:
(1) Shengqiong Wu, NExT++, School of Computing, National University of Singapore;(2) Hao Fei ,from NExT++, School of Computing at the National University of Singapore, serves as the corresponding author: [email protected].
(3) Leigang Qu, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];;
(4) Wei Ji, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];;
(5) Tat-Seng Chua, Hao Fei, NExT++, School of Computing, National University of Singapore is the corresponding author: [email protected];.
Cross-modal Understanding and Generation Our world is replete with multimodal information, wherein we continuously engage in the intricate task of comprehending and producing cross-modal content. The AI community correspondingly emerges varied forms of cross-modal learning tasks, such as Image/Video Captioning [99, 16, 56, 56, 27, 49], Image/Video Question Answering [94, 90, 48, 98, 3], Text-to-Image/Video/Speech Synthesis [74, 30, 84, 23, 17, 51, 33], Image-to-Video Synthesis [18, 37] and more, all of which have experienced rapid advancements in past decades. Researchers have proposed highly effective multimodal encoders, with the aim of constructing unified representations encompassing various modalities. Meanwhile, owing to the distinct feature spaces of different modalities, it is essential to undertake modality alignment learning. Moreover, to generate high-quality content, a multitude of strong-performing methods have been proposed, such as Transformer [82, 101, 17, 24], GANs [53, 7, 93, 110], VAEs [81, 67], Flow models [73, 6] and the current state-of-the-art diffusion models [31, 64, 57, 22, 68]. Especially, the diffusion-based methods have recently delivered remarkable performance in a plethora of cross-modal generation tasks, such as DALL-E [66], Stable Diffusion [68]. While all previous efforts of cross-modal learning are limited to the comprehension of multimodal inputs only, CoDi [78] lately presents groundbreaking development. Leveraging the power of diffusion models, CoDi possesses the ability to generate any combination of output modalities, including language, images, videos, or audio, from any combination of input modalities in parallel. Regrettably, CoDi might still fall short of achieving human-like deep reasoning of input content, with only parallel cross-modal feeding&generation.
Multimodal Large Language Models LLMs have already made profound impacts and revolutions on the entire AI community and beyond. The most notable LLMs, i.e., OpenAI’s ChatGPT [59] and GPT4 [60], with alignment techniques such as instruction tuning [61, 47, 104, 52] and reinforcement learning from human feedback (RLHF) [75], have demonstrated remarkable language understanding and reasoning abilities. And a series of open-source LLMs, e.g., Flan-T5 [13], Vicuna [12], LLaMA [80] and Alpaca [79], have greatly spurred advancement and made contributions to the community [109, 100]. Afterward, significant efforts have been made to construct LLMs dealing with multimodal inputs and tasks, leading to the development of MM-LLMs.