Auxiliary Cross-Modal Representation Learning with Triplet Loss
Functions for Online Handwriting Recognition
- URL: http://arxiv.org/abs/2202.07901v3
- Date: Thu, 3 Aug 2023 11:36:06 GMT
- Title: Auxiliary Cross-Modal Representation Learning with Triplet Loss
Functions for Online Handwriting Recognition
- Authors: Felix Ott and David R\"ugamer and Lucas Heublein and Bernd Bischl and
Christopher Mutschler
- Abstract summary: Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task.
We present a triplet loss with a dynamic margin for single label and sequence-to-sequence classification tasks.
Our experiments show an improved classification accuracy, faster convergence, and better generalizability due to an improved cross-modal representation.
- Score: 3.071136270246468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal representation learning learns a shared embedding between two or
more modalities to improve performance in a given task compared to using only
one of the modalities. Cross-modal representation learning from different data
types -- such as images and time-series data (e.g., audio or text data) --
requires a deep metric learning loss that minimizes the distance between the
modality embeddings. In this paper, we propose to use the contrastive or
triplet loss, which uses positive and negative identities to create sample
pairs with different labels, for cross-modal representation learning between
image and time-series modalities (CMR-IS). By adapting the triplet loss for
cross-modal representation learning, higher accuracy in the main (time-series
classification) task can be achieved by exploiting additional information of
the auxiliary (image classification) task. We present a triplet loss with a
dynamic margin for single label and sequence-to-sequence classification tasks.
We perform extensive evaluations on synthetic image and time-series data, and
on data for offline handwriting recognition (HWR) and on online HWR from
sensor-enhanced pens for classifying written words. Our experiments show an
improved classification accuracy, faster convergence, and better
generalizability due to an improved cross-modal representation. Furthermore,
the more suitable generalizability leads to a better adaptability between
writers for online HWR.
Related papers
- Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences.
We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision.
Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - SCMM: Calibrating Cross-modal Representations for Text-Based Person Search [43.17325362167387]
Text-Based Person Search (TBPS) is a crucial task that enables accurate retrieval of target individuals from large-scale galleries.
For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space.
We present a method named Sew and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings.
arXiv Detail & Related papers (2023-04-05T07:50:16Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for
Image-Text Retrieval [108.48540976175457]
We propose Coupled Diversity-Sensitive Momentum Constrastive Learning (CODER) for improving cross-modal representation.
We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.
Experiments conducted on two popular benchmarks, i.e. MSCOCO and Flicker30K, validate CODER remarkably outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2022-08-21T08:37:50Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - S2-Net: Self-supervision Guided Feature Representation Learning for
Cross-Modality Images [0.0]
Cross-modality image pairs often fail to make the feature representations of correspondences as close as possible.
In this letter, we design a cross-modality feature representation learning network, S2-Net, which is based on the recently successful detect-and-describe pipeline.
We introduce self-supervised learning with a well-designed loss function to guide the training without discarding the original advantages.
arXiv Detail & Related papers (2022-03-28T08:47:49Z) - SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval [15.522964295287425]
We propose a novel loss function that is based on self-labeling of the unknown classes.
We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval.
arXiv Detail & Related papers (2021-11-10T17:17:09Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Graph Convolution for Re-ranking in Person Re-identification [40.9727538382413]
We propose a graph-based re-ranking method to improve learned features while still keeping Euclidean distance as the similarity metric.
A simple yet effective method is proposed to generate a profile vector for each tracklet in videos, which helps extend our method to video re-ID.
arXiv Detail & Related papers (2021-07-05T18:40:43Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.