Related papers: Video Relationship Detection Using Mixture of Experts

Video Relationship Detection Using Mixture of Experts

URL: http://arxiv.org/abs/2403.03994v1
Date: Wed, 6 Mar 2024 19:08:34 GMT
Title: Video Relationship Detection Using Mixture of Experts
Authors: Ala Shaabana and Zahra Gharaee and Paul Fieguth
Abstract summary: We introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of subject, predicate, object>s to extract relationships from visual processing. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.
Score: 1.6574413179773761
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Machine comprehension of visual information from images and videos by neural networks faces two primary challenges. Firstly, there exists a computational and inference gap in connecting vision and language, making it difficult to accurately determine which object a given agent acts on and represent it through language. Secondly, classifiers trained by a single, monolithic neural network often lack stability and generalization. To overcome these challenges, we introduce MoE-VRD, a novel approach to visual relationship detection utilizing a mixture of experts. MoE-VRD identifies language triplets in the form of < subject, predicate, object> tuples to extract relationships from visual processing. Leveraging recent advancements in visual relationship detection, MoE-VRD addresses the requirement for action recognition in establishing relationships between subjects (acting) and objects (being acted upon). In contrast to single monolithic networks, MoE-VRD employs multiple small models as experts, whose outputs are aggregated. Each expert in MoE-VRD specializes in visual relationship learning and object tagging. By utilizing a sparsely-gated mixture of experts, MoE-VRD enables conditional computation and significantly enhances neural network capacity without increasing computational complexity. Our experimental results demonstrate that the conditional computation capabilities and scalability of the mixture-of-experts approach lead to superior performance in visual relationship detection compared to state-of-the-art methods.

Related papers

SDR-GNN: Spectral Domain Reconstruction Graph Neural Network for Incomplete Multimodal Learning in Conversational Emotion Recognition [14.645598552036908]
Multimodal Emotion Recognition in Conversations (MERC) aims to classify utterance emotions using textual, auditory, and visual modal features. Most existing MERC methods assume each utterance has complete modalities, overlooking the common issue of incomplete modalities in real-world scenarios. We propose a Spectral Domain Reconstruction Graph Neural Network (SDR-GNN) for incomplete multimodal learning in conversational emotion recognition.
arXiv Detail & Related papers (2024-11-29T16:31:50Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting [68.37943632270505]
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond categories. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories. We propose an open-vocabulary relationship that leverages the rich semantic knowledge of CLIP to discover novel relationships.
arXiv Detail & Related papers (2024-09-19T06:25:01Z)
Sparse Multitask Learning for Efficient Neural Representation of Motor Imagery and Execution [30.186917337606477]
We introduce a sparse multitask learning framework for motor imagery (MI) and motor execution (ME) tasks. Given a dual-task CNN model for MI-ME classification, we apply a saliency-based sparsification approach to prune superfluous connections. Our results indicate that this tailored sparsity can mitigate the overfitting problem and improve the test performance with small amount of data.
arXiv Detail & Related papers (2023-12-10T09:06:16Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection [57.13665112065285]
Human-Object Interaction (HOI) detection is a challenging computer vision task. We present a framework that enhances HOI detection by incorporating structured text knowledge.
arXiv Detail & Related papers (2023-07-25T14:20:52Z)
SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models [22.472167814814448]
We propose a new model-based imitation learning algorithm named Separated Model-based Adversarial Imitation Learning (SeMAIL) Our method achieves near-expert performance on various visual control tasks with complex observations and the more challenging tasks with different backgrounds from expert observations.
arXiv Detail & Related papers (2023-06-19T04:33:44Z)
Weakly-Supervised HOI Detection from Interaction Labels Only and Language/Vision-Language Priors [36.75629570208193]
Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image. In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels. We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model. Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis
arXiv Detail & Related papers (2023-03-09T19:08:02Z)
Knowledge-augmented Few-shot Visual Relation Detection [25.457693302327637]
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding. Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance. We devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge.
arXiv Detail & Related papers (2023-03-09T15:38:40Z)
Visual Relationship Detection with Visual-Linguistic Knowledge from Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images. We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT) RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.