Efficient sign language recognition system and dataset creation method
based on deep learning and image processing
- URL: http://arxiv.org/abs/2103.12233v1
- Date: Mon, 22 Mar 2021 23:36:49 GMT
- Title: Efficient sign language recognition system and dataset creation method
based on deep learning and image processing
- Authors: Alvaro Leandro Cavalcante Carneiro, Lucas de Brito Silva, Denis
Henrique Pinheiro Salvedeo
- Abstract summary: This work investigates techniques of digital image processing and machine learning that can be used to create a sign language dataset effectively.
Different datasets were created to test the hypotheses, containing 14 words used daily and recorded by different smartphones in the RGB color system.
We achieved an accuracy of 96.38% on the test set and 81.36% on the validation set containing more challenging conditions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: New deep-learning architectures are created every year, achieving
state-of-the-art results in image recognition and leading to the belief that,
in a few years, complex tasks such as sign language translation will be
considerably easier, serving as a communication tool for the hearing-impaired
community. On the other hand, these algorithms still need a lot of data to be
trained and the dataset creation process is expensive, time-consuming, and
slow. Thereby, this work aims to investigate techniques of digital image
processing and machine learning that can be used to create a sign language
dataset effectively. We argue about data acquisition, such as the frames per
second rate to capture or subsample the videos, the background type,
preprocessing, and data augmentation, using convolutional neural networks and
object detection to create an image classifier and comparing the results based
on statistical tests. Different datasets were created to test the hypotheses,
containing 14 words used daily and recorded by different smartphones in the RGB
color system. We achieved an accuracy of 96.38% on the test set and 81.36% on
the validation set containing more challenging conditions, showing that 30 FPS
is the best frame rate subsample to train the classifier, geometric
transformations work better than intensity transformations, and artificial
background creation is not effective to model generalization. These trade-offs
should be considered in future work as a cost-benefit guideline between
computational cost and accuracy gain when creating a dataset and training a
sign recognition model.
Related papers
- Deep Image Composition Meets Image Forgery [0.0]
Image forgery has been studied for many years.
Deep learning models require large amounts of labeled data for training.
We use state of the art image composition deep learning models to generate spliced images close to the quality of real-life manipulations.
arXiv Detail & Related papers (2024-04-03T17:54:37Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Text recognition on images using pre-trained CNN [2.191505742658975]
The recognition is trained by using Chars74K dataset and the best model results then tested on some samples of IIIT-5K-Dataset.
The research model has an accuracy of 97.94% for validation data, 98.16% for test data, and 95.62% for the test data from IIIT-5K-Dataset.
arXiv Detail & Related papers (2023-02-10T08:09:51Z) - Procedural Image Programs for Representation Learning [62.557911005179946]
We propose training with a large dataset of twenty-one thousand programs, each one generating a diverse set of synthetic images.
These programs are short code snippets, which are easy to modify and fast to execute using.
The proposed dataset can be used for both supervised and unsupervised representation learning, and reduces the gap between pre-training with real and procedurally generated images by 38%.
arXiv Detail & Related papers (2022-11-29T17:34:22Z) - Reading and Writing: Discriminative and Generative Modeling for
Self-Supervised Text Recognition [101.60244147302197]
We introduce contrastive learning and masked image modeling to learn discrimination and generation of text images.
Our method outperforms previous self-supervised text recognition methods by 10.2%-20.2% on irregular scene text recognition datasets.
Our proposed text recognizer exceeds previous state-of-the-art text recognition methods by averagely 5.3% on 11 benchmarks, with similar model size.
arXiv Detail & Related papers (2022-07-01T03:50:26Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images.
We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image.
We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z) - Exploiting the relationship between visual and textual features in
social networks for image classification with zero-shot deep learning [0.0]
In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture.
Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part.
Considering the associated texts to the images can help to improve the accuracy depending on the goal.
arXiv Detail & Related papers (2021-07-08T10:54:59Z) - Single Image Texture Translation for Data Augmentation [24.412953581659448]
We propose a lightweight model for translating texture to images based on a single input of source texture.
We then explore the use of augmented data in long-tailed and few-shot image classification tasks.
We find the proposed method is capable of translating input data into a target domain, leading to consistent improved image recognition performance.
arXiv Detail & Related papers (2021-06-25T17:59:04Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - PennSyn2Real: Training Object Recognition Models without Human Labeling [12.923677573437699]
We propose PennSyn2Real - a synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs)
The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification.
We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation.
arXiv Detail & Related papers (2020-09-22T02:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.