A Review on Explainability in Multimodal Deep Neural Nets
- URL: http://arxiv.org/abs/2105.07878v2
- Date: Tue, 18 May 2021 11:53:33 GMT
- Title: A Review on Explainability in Multimodal Deep Neural Nets
- Authors: Gargi Joshi, Rahee Walambe, Ketan Kotecha
- Abstract summary: multimodal AI techniques have achieved much success in several application domains.
Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability.
This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets.
- Score: 2.3204178451683264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial Intelligence techniques powered by deep neural nets have achieved
much success in several application domains, most significantly and notably in
the Computer Vision applications and Natural Language Processing tasks.
Surpassing human-level performance propelled the research in the applications
where different modalities amongst language, vision, sensory, text play an
important role in accurate predictions and identification. Several multimodal
fusion methods employing deep learning models are proposed in the literature.
Despite their outstanding performance, the complex, opaque and black-box nature
of the deep neural nets limits their social acceptance and usability. This has
given rise to the quest for model interpretability and explainability, more so
in the complex tasks involving multimodal AI methods. This paper extensively
reviews the present literature to present a comprehensive survey and commentary
on the explainability in multimodal deep neural nets, especially for the vision
and language tasks. Several topics on multimodal AI and its applications for
generic domains have been covered in this paper, including the significance,
datasets, fundamental building blocks of the methods and techniques,
challenges, applications, and future trends in this domain
Related papers
- Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey [46.617998833238126]
Large language models (LLMs) and computer vision (CV) systems driving advancements in natural language understanding and visual processing.
The convergence of these technologies has catalyzed the rise of multimodal AI, enabling richer, cross-modal understanding that spans text, vision, audio, and video modalities.
Multimodal large language models (MLLMs) have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval.
Despite these advancements, the complexity and scale of MLLMs introduce significant challenges in interpretability and explainability, essential for establishing
arXiv Detail & Related papers (2024-12-03T02:54:31Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Explaining Deep Neural Networks by Leveraging Intrinsic Methods [0.9790236766474201]
This thesis contributes to the field of eXplainable AI, focusing on enhancing the interpretability of deep neural networks.
The core contributions lie in introducing novel techniques aimed at making these networks more interpretable by leveraging an analysis of their inner workings.
Secondly, this research delves into novel investigations on neurons within trained deep neural networks, shedding light on overlooked phenomena related to their activation values.
arXiv Detail & Related papers (2024-07-17T01:20:17Z) - HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - A Survey on Vision-Language-Action Models for Embodied AI [71.16123093739932]
Vision-language-action models (VLAs) have become a foundational element in robot learning.
Various methods have been proposed to enhance traits such as versatility, dexterity, and generalizability.
VLAs serve as high-level task planners capable of decomposing long-horizon tasks into executable subtasks.
arXiv Detail & Related papers (2024-05-23T01:43:54Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - A Survey on State-of-the-art Deep Learning Applications and Challenges [0.0]
Building a deep learning model is challenging due to the algorithm's complexity and the dynamic nature of real-world problems.
This study aims to comprehensively review the state-of-the-art deep learning models in computer vision, natural language processing, time series analysis and pervasive computing.
arXiv Detail & Related papers (2024-03-26T10:10:53Z) - Foundations and Recent Trends in Multimodal Machine Learning:
Principles, Challenges, and Open Questions [68.6358773622615]
This paper provides an overview of the computational and theoretical foundations of multimodal machine learning.
We propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification.
Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches.
arXiv Detail & Related papers (2022-09-07T19:21:19Z) - Recent Advances and Trends in Multimodal Deep Learning: A Review [9.11022096530605]
Multimodal deep learning aims to create models that can process and link information using various modalities.
This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, and physiological signals.
A fine-grained taxonomy of various multimodal deep learning applications is proposed, elaborating on different applications in more depth.
arXiv Detail & Related papers (2021-05-24T04:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.