Related papers: Multi-Perspective LSTM for Joint Visual Representation Learning

Multi-Perspective LSTM for Joint Visual Representation Learning

URL: http://arxiv.org/abs/2105.02802v1
Date: Thu, 6 May 2021 16:44:40 GMT
Title: Multi-Perspective LSTM for Joint Visual Representation Learning
Authors: Alireza Sepas-Moghaddam, Fernando Pereira, Paulo Lobato Correia, Ali Etemad
Abstract summary: We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives. Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level. We show that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks.
Score: 81.21490913108835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives. Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level. We demonstrate that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks. We validate the performance of our proposed architecture in the context of two multi-perspective visual recognition tasks namely lip reading and face recognition. Three relevant datasets are considered and the results are compared against fusion strategies, other existing multi-input LSTM architectures, and alternative recognition solutions. The experiments show the superior performance of our solution over the considered benchmarks, both in terms of recognition accuracy and complexity. We make our code publicly available at https://github.com/arsm/MPLSTM.

Related papers

XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition [20.989787824067143]
XR-VLM is a novel mechanism to discover subtle differences by modeling cross-relationships. We develop a multi-part prompt learning module to capture multi-perspective descriptions. Our method achieves significant improvements compared to current state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-10T08:58:05Z)
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval [44.008094698200026]
Cross-modal retrieval is gaining increasing efficacy and interest from the research community. In this paper, we design an approach that allows for multimodal queries composed of both an image and a text. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones.
arXiv Detail & Related papers (2025-03-03T19:01:17Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition [9.905155497581815]
We introduce automated network search (NAS) algorithms to the field of eye movement recognition. Relax DARTS is an improvement of the Differentiable Architecture Search (DARTS) to realize more efficient network search and training. Relax DARTS exhibits adaptability to other multi-feature temporal classification tasks.
arXiv Detail & Related papers (2024-09-18T02:37:04Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Task agnostic continual learning with Pairwise layer architecture [0.0]
We show that we can improve the continual learning performance by replacing the final layer of our networks with our pairwise interaction layer. The networks using this architecture show competitive performance in MNIST and FashionMNIST-based continual image classification experiments.
arXiv Detail & Related papers (2024-05-22T13:30:01Z)
Handling Data Heterogeneity via Architectural Design for Federated Visual Recognition [16.50490537786593]
We study 19 visual recognition models from five different architectural families on four challenging FL datasets. Our findings emphasize the importance of architectural design for computer vision tasks in practical scenarios.
arXiv Detail & Related papers (2023-10-23T17:59:16Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection. Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z)
ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)
Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem. To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.