AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and
Baseline Methods
- URL: http://arxiv.org/abs/2008.00932v2
- Date: Mon, 19 Oct 2020 10:31:48 GMT
- Title: AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and
Baseline Methods
- Authors: Ozge Mercanoglu Sincan and Hacer Yalim Keles
- Abstract summary: We present a new largescale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark.
Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples.
We trained several deep learning based models and provide empirical evaluations using the benchmark.
- Score: 6.320141734801679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sign language recognition is a challenging problem where signs are identified
by simultaneous local and global articulations of multiple sources, i.e. hand
shape and orientation, hand movements, body posture, and facial expressions.
Solving this problem computationally for a large vocabulary of signs in real
life settings is still a challenge, even with the state-of-the-art models. In
this study, we present a new largescale multi-modal Turkish Sign Language
dataset (AUTSL) with a benchmark and provide baseline models for performance
evaluations. Our dataset consists of 226 signs performed by 43 different
signers and 38,336 isolated sign video samples in total. Samples contain a wide
variety of backgrounds recorded in indoor and outdoor environments. Moreover,
spatial positions and the postures of signers also vary in the recordings. Each
sample is recorded with Microsoft Kinect v2 and contains RGB, depth, and
skeleton modalities. We prepared benchmark training and test sets for user
independent assessments of the models. We trained several deep learning based
models and provide empirical evaluations using the benchmark; we used CNNs to
extract features, unidirectional and bidirectional LSTM models to characterize
temporal information. We also incorporated feature pooling modules and temporal
attention to our models to improve the performances. We evaluated our baseline
models on AUTSL and Montalbano datasets. Our models achieved competitive
results with the state-of-the-art methods on Montalbano dataset, i.e. 96.11%
accuracy. In AUTSL random train-test splits, our models performed up to 95.95%
accuracy. In the proposed user-independent benchmark dataset our best baseline
model achieved 62.02% accuracy. The gaps in the performances of the same
baseline models show the challenges inherent in our benchmark dataset. AUTSL
benchmark dataset is publicly available at https://cvml.ankara.edu.tr.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - SkelCap: Automated Generation of Descriptive Text from Skeleton Keypoint Sequences [2.0257616108612373]
We structured this dataset around AUTSL, a comprehensive isolated Turkish sign language dataset.
We also developed a baseline model, SkelCap, which can generate textual descriptions of body movements.
The model achieved promising results, with a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation.
arXiv Detail & Related papers (2024-05-05T15:50:02Z) - Influence Scores at Scale for Efficient Language Data Sampling [3.072340427031969]
"influence scores" are used to identify important subsets of data.
In this paper, we explore the applicability of influence scores in language classification tasks.
arXiv Detail & Related papers (2023-11-27T20:19:22Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages.
We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z) - Evaluation of HTR models without Ground Truth Material [2.4792948967354236]
evaluation of Handwritten Text Recognition models during their development is straightforward.
But the evaluation process becomes tricky as soon as we switch from development to application.
We show that lexicon-based evaluation can compete with lexicon-based methods.
arXiv Detail & Related papers (2022-01-17T01:26:09Z) - Fortunately, Discourse Markers Can Enhance Language Models for Sentiment
Analysis [13.149482582098429]
We propose to leverage sentiment-carrying discourse markers to generate large-scale weakly-labeled data.
We show the value of our approach on various benchmark datasets, including the finance domain.
arXiv Detail & Related papers (2022-01-06T12:33:47Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Towards Trustworthy Deception Detection: Benchmarking Model Robustness
across Domains, Modalities, and Languages [10.131671217810581]
We evaluate model robustness to out-of-domain data, modality-specific features, and languages other than English.
We find that with additional image content as input, ELMo embeddings yield significantly fewer errors compared to BERT orGLoVe.
arXiv Detail & Related papers (2021-04-23T18:05:52Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.