Content4All Open Research Sign Language Translation Datasets
- URL: http://arxiv.org/abs/2105.02351v1
- Date: Wed, 5 May 2021 22:14:53 GMT
- Title: Content4All Open Research Sign Language Translation Datasets
- Authors: Necati Cihan Camgoz, Ben Saunders, Guillaume Rochette, Marco
Giovanelli, Giacomo Inches, Robin Nachtrab-Ribback, Richard Bowden
- Abstract summary: We release six datasets comprised of 190 hours of footage on the larger domain of news.
From this, 20 hours of footage have been annotated by Deaf experts and interpreters and is made publicly available for research purposes.
- Score: 27.36513138911057
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Computational sign language research lacks the large-scale datasets that
enables the creation of useful reallife applications. To date, most research
has been limited to prototype systems on small domains of discourse, e.g.
weather forecasts. To address this issue and to push the field forward, we
release six datasets comprised of 190 hours of footage on the larger domain of
news. From this, 20 hours of footage have been annotated by Deaf experts and
interpreters and is made publicly available for research purposes. In this
paper, we share the dataset collection process and tools developed to enable
the alignment of sign language video and subtitles, as well as baseline
translation results to underpin future research.
Related papers
- Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in
Indonesian [0.0]
We construct the first public Indonesian video-text dataset by translating English sentences from the MSVD dataset to Indonesian sentences.
We then train neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning.
arXiv Detail & Related papers (2023-06-20T07:19:36Z) - Using Large Language Models to Generate Engaging Captions for Data
Visualizations [51.98253121636079]
Large language models (LLM) use sophisticated deep learning technology to produce human-like prose.
Key challenge lies in designing the most effective prompt for the LLM, a task called prompt engineering.
We report on first experiments using the popular LLM GPT-3 and deliver some promising results.
arXiv Detail & Related papers (2022-12-27T23:56:57Z) - MAD: A Scalable Dataset for Language Grounding in Videos from Movie
Audio Descriptions [109.84031235538002]
We present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations.
MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets.
arXiv Detail & Related papers (2021-12-01T11:47:09Z) - Survey of Low-Resource Machine Translation [65.52755521004794]
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models.
There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available.
arXiv Detail & Related papers (2021-09-01T16:57:58Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - A High-Quality Multilingual Dataset for Structured Documentation
Translation [101.41835967142521]
This paper presents a high-quality multilingual dataset for the documentation domain.
We collect XML-structured parallel text segments from the online documentation for an enterprise software platform.
arXiv Detail & Related papers (2020-06-24T02:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.