Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges
- URL: http://arxiv.org/abs/2412.03220v1
- Date: Wed, 04 Dec 2024 11:14:06 GMT
- Title: Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges
- Authors: Minghao Shao, Abdul Basit, Ramesh Karri, Muhammad Shafique,
- Abstract summary: Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language.
An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities.
This survey provides a comprehensive overview of the recent advancements in LLMs.
- Score: 15.850548556536538
- License:
- Abstract: Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural networks, often encompassing dozens of neural network layers and containing billions to trillions of parameters. They are typically trained on vast datasets, utilizing architectures based on transformer blocks. Present-day LLMs are multi-functional, capable of performing a range of tasks from text generation and language translation to question answering, as well as code generation and analysis. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities, including images, audio, and video. This enhancement empowers MLLMs with capabilities like video editing, image comprehension, and captioning for visual content. This survey provides a comprehensive overview of the recent advancements in LLMs. We begin by tracing the evolution of LLMs and subsequently delve into the advent and nuances of MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical features, strengths, and limitations. Additionally, we present a comparative analysis of these models and discuss their challenges, potential limitations, and prospects for future development.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - A Review of Multi-Modal Large Language and Vision Models [1.9685736810241874]
Large Language Models (LLMs) have emerged as a focal point of research and application.
Recently, LLMs have been extended into multi-modal large language models (MM-LLMs)
This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs.
arXiv Detail & Related papers (2024-03-28T15:53:45Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - The Revolution of Multimodal Large Language Models: A Survey [46.84953515670248]
Multimodal Large Language Models (MLLMs) can seamlessly integrate visual and textual modalities.
This paper provides a review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques.
arXiv Detail & Related papers (2024-02-19T19:01:01Z) - Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks.
The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human.
These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.