End-to-End Lip Reading in Romanian with Cross-Lingual Domain Adaptation
and Lateral Inhibition
- URL: http://arxiv.org/abs/2310.04858v1
- Date: Sat, 7 Oct 2023 15:36:58 GMT
- Title: End-to-End Lip Reading in Romanian with Cross-Lingual Domain Adaptation
and Lateral Inhibition
- Authors: Emilian-Claudiu M\u{a}nescu, R\u{a}zvan-Alexandru Sm\u{a}du,
Andrei-Marius Avram, Dumitru-Clementin Cercel, Florin Pop
- Abstract summary: We analyze several architectures and optimizations on the underrepresented, short-scale Romanian language dataset called Wild LRRo.
We obtain state-of-the-art results using our proposed method, namely cross-lingual domain adaptation and unlabeled videos.
We also assess the performance of adding a layer inspired by the neural inhibition mechanism.
- Score: 2.839471733237535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lip reading or visual speech recognition has gained significant attention in
recent years, particularly because of hardware development and innovations in
computer vision. While considerable progress has been obtained, most models
have only been tested on a few large-scale datasets. This work addresses this
shortcoming by analyzing several architectures and optimizations on the
underrepresented, short-scale Romanian language dataset called Wild LRRo. Most
notably, we compare different backend modules, demonstrating the effectiveness
of adding ample regularization methods. We obtain state-of-the-art results
using our proposed method, namely cross-lingual domain adaptation and unlabeled
videos from English and German datasets to help the model learn
language-invariant features. Lastly, we assess the performance of adding a
layer inspired by the neural inhibition mechanism.
Related papers
- FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Revisiting Neural Scaling Laws in Language and Vision [43.57394336742374]
We argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting parameters.
We present a recipe for estimating scaling law parameters reliably from learning curves.
We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains.
arXiv Detail & Related papers (2022-09-13T09:41:51Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - Meta-Learning and Self-Supervised Pretraining for Real World Image
Translation [5.469808405577674]
We explore image-to-image translation problem in order to formulate a novel multi-task few-shot image generation benchmark.
We present several baselines for the few-shotiru problem and discuss trade-offs between different approaches.
arXiv Detail & Related papers (2021-12-22T14:48:22Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Cross-lingual Approach to Abstractive Summarization [0.0]
Cross-lingual model transfers are successfully applied in low-resource languages.
We used a pretrained English summarization model based on deep neural networks and sequence-to-sequence architecture.
We developed several models with different proportions of target language data for fine-tuning.
arXiv Detail & Related papers (2020-12-08T09:30:38Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.