Signing Outside the Studio: Benchmarking Background Robustness for
Continuous Sign Language Recognition
- URL: http://arxiv.org/abs/2211.00448v1
- Date: Tue, 1 Nov 2022 13:27:44 GMT
- Title: Signing Outside the Studio: Benchmarking Background Robustness for
Continuous Sign Language Recognition
- Authors: Youngjoon Jang, Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, Joon Son
Chung, In So Kweon
- Abstract summary: We propose a pipeline to automatically generate a benchmark dataset utilizing existing Continuous Sign Language Recognition benchmarks.
Our newly constructed benchmark dataset consists of diverse scenes to simulate a real-world environment.
In this regard, we also propose a simple yet effective training scheme including (1) background randomization and (2) feature disentanglement for CSLR models.
- Score: 79.23777980180755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this work is background-robust continuous sign language
recognition. Most existing Continuous Sign Language Recognition (CSLR)
benchmarks have fixed backgrounds and are filmed in studios with a static
monochromatic background. However, signing is not limited only to studios in
the real world. In order to analyze the robustness of CSLR models under
background shifts, we first evaluate existing state-of-the-art CSLR models on
diverse backgrounds. To synthesize the sign videos with a variety of
backgrounds, we propose a pipeline to automatically generate a benchmark
dataset utilizing existing CSLR benchmarks. Our newly constructed benchmark
dataset consists of diverse scenes to simulate a real-world environment. We
observe even the most recent CSLR method cannot recognize glosses well on our
new dataset with changed backgrounds. In this regard, we also propose a simple
yet effective training scheme including (1) background randomization and (2)
feature disentanglement for CSLR models. The experimental results on our
dataset demonstrate that our method generalizes well to other unseen background
data with minimal additional training images.
Related papers
- A Chinese Continuous Sign Language Dataset Based on Complex Environments [17.195286118443256]
We have constructed a large-scale dataset for Chinese continuous sign language (CSL) based on complex environments.
This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes.
We propose a time-frequency network (TFNet) model for continuous sign language recognition.
arXiv Detail & Related papers (2024-09-18T13:11:15Z) - Contrastive Learning with Synthetic Positives [11.932323457691945]
Contrastive learning with the nearest neighbor has proved to be one of the most efficient self-supervised learning (SSL) techniques.
In this paper, we introduce a novel approach called Contrastive Learning with Synthetic Positives (NCLP)
NCLP utilizes synthetic images, generated by an unconditional diffusion model, as the additional positives to help the model learn from diverse positives.
arXiv Detail & Related papers (2024-08-30T01:47:43Z) - A Transformer Model for Boundary Detection in Continuous Sign Language [55.05986614979846]
The Transformer model is employed for both Isolated Sign Language Recognition and Continuous Sign Language Recognition.
The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched.
The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos.
arXiv Detail & Related papers (2024-02-22T17:25:01Z) - Visual Self-supervised Learning Scheme for Dense Prediction Tasks on X-ray Images [3.782392436834913]
Self-supervised learning (SSL) has led to considerable progress in natural language processing (NLP)
However, the incorporation of contrastive learning into existing visual SSL models has led to considerable progress, often surpassing supervised counterparts.
Here, we focus on dense prediction tasks using security inspection x-ray images to evaluate our proposed model, Segment localization (SegLoc)
Based upon the Instance localization (InsLoc) model, SegLoc addresses one of the key challenges of contrastive learning, i.e., false negative pairs of query embeddings.
arXiv Detail & Related papers (2023-10-12T15:42:17Z) - PRIOR: Prototype Representation Joint Learning from Medical Images and
Reports [19.336988866061294]
We present a prototype representation learning framework incorporating both global and local alignment between medical images and reports.
In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation.
A sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features.
arXiv Detail & Related papers (2023-07-24T07:49:01Z) - Evaluating The Robustness of Self-Supervised Representations to
Background/Foreground Removal [4.007351600492541]
We consider state-of-the-art SSL pretrained models, such as DINOv2, MAE, and SwaV, and analyzed changes at the representation levels across 4 Image Classification datasets.
Empirically, we show that not all models lead to representations that separate foreground, background, and complete images.
arXiv Detail & Related papers (2023-06-02T09:46:22Z) - DETA: Denoised Task Adaptation for Few-Shot Learning [135.96805271128645]
Test-time task adaptation in few-shot learning aims to adapt a pre-trained task-agnostic model for capturing taskspecific knowledge.
With only a handful of samples available, the adverse effect of either the image noise (a.k.a. X-noise) or the label noise (a.k.a. Y-noise) from support samples can be severely amplified.
We propose DEnoised Task Adaptation (DETA), a first, unified image- and label-denoising framework to existing task adaptation approaches.
arXiv Detail & Related papers (2023-03-11T05:23:20Z) - Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image.
The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model.
We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z) - Rectifying the Shortcut Learning of Background: Shared Object
Concentration for Few-Shot Image Recognition [101.59989523028264]
Few-Shot image classification aims to utilize pretrained knowledge learned from a large-scale dataset to tackle a series of downstream classification tasks.
We propose COSOC, a novel Few-Shot Learning framework, to automatically figure out foreground objects at both pretraining and evaluation stage.
arXiv Detail & Related papers (2021-07-16T07:46:41Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.