Multi-Perspective LSTM for Joint Visual Representation Learning
- URL: http://arxiv.org/abs/2105.02802v1
- Date: Thu, 6 May 2021 16:44:40 GMT
- Title: Multi-Perspective LSTM for Joint Visual Representation Learning
- Authors: Alireza Sepas-Moghaddam, Fernando Pereira, Paulo Lobato Correia, Ali
Etemad
- Abstract summary: We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives.
Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level.
We show that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks.
- Score: 81.21490913108835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel LSTM cell architecture capable of learning both intra- and
inter-perspective relationships available in visual sequences captured from
multiple perspectives. Our architecture adopts a novel recurrent joint learning
strategy that uses additional gates and memories at the cell level. We
demonstrate that by using the proposed cell to create a network, more effective
and richer visual representations are learned for recognition tasks. We
validate the performance of our proposed architecture in the context of two
multi-perspective visual recognition tasks namely lip reading and face
recognition. Three relevant datasets are considered and the results are
compared against fusion strategies, other existing multi-input LSTM
architectures, and alternative recognition solutions. The experiments show the
superior performance of our solution over the considered benchmarks, both in
terms of recognition accuracy and complexity. We make our code publicly
available at https://github.com/arsm/MPLSTM.
Related papers
- Relax DARTS: Relaxing the Constraints of Differentiable Architecture Search for Eye Movement Recognition [9.905155497581815]
We introduce automated network search (NAS) algorithms to the field of eye movement recognition.
Relax DARTS is an improvement of the Differentiable Architecture Search (DARTS) to realize more efficient network search and training.
Relax DARTS exhibits adaptability to other multi-feature temporal classification tasks.
arXiv Detail & Related papers (2024-09-18T02:37:04Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Task agnostic continual learning with Pairwise layer architecture [0.0]
We show that we can improve the continual learning performance by replacing the final layer of our networks with our pairwise interaction layer.
The networks using this architecture show competitive performance in MNIST and FashionMNIST-based continual image classification experiments.
arXiv Detail & Related papers (2024-05-22T13:30:01Z) - Handling Data Heterogeneity via Architectural Design for Federated
Visual Recognition [16.50490537786593]
We study 19 visual recognition models from five different architectural families on four challenging FL datasets.
Our findings emphasize the importance of architectural design for computer vision tasks in practical scenarios.
arXiv Detail & Related papers (2023-10-23T17:59:16Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection.
Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps.
Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.