Thinking Like an Annotator: Generation of Dataset Labeling Instructions
- URL: http://arxiv.org/abs/2306.14035v1
- Date: Sat, 24 Jun 2023 18:32:48 GMT
- Title: Thinking Like an Annotator: Generation of Dataset Labeling Instructions
- Authors: Nadine Chang, Francesco Ferroni, Michael J. Tarr, Martial Hebert, Deva
Ramanan
- Abstract summary: We introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions.
We take a reasonably annotated dataset and: 1) generate a set of examples that are visually representative of each category in the dataset; 2) provide a text label that corresponds to each of the examples.
This framework acts as a proxy to human annotators that can help to both generate a final labeling instruction set and evaluate its quality.
- Score: 59.603239753484345
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale datasets are essential to modern day deep learning. Advocates
argue that understanding these methods requires dataset transparency (e.g.
"dataset curation, motivation, composition, collection process, etc...").
However, almost no one has suggested the release of the detailed definitions
and visual category examples provided to annotators - information critical to
understanding the structure of the annotations present in each dataset. These
labels are at the heart of public datasets, yet few datasets include the
instructions that were used to generate them. We introduce a new task, Labeling
Instruction Generation, to address missing publicly available labeling
instructions. In Labeling Instruction Generation, we take a reasonably
annotated dataset and: 1) generate a set of examples that are visually
representative of each category in the dataset; 2) provide a text label that
corresponds to each of the examples. We introduce a framework that requires no
model training to solve this task and includes a newly created rapid retrieval
system that leverages a large, pre-trained vision and language model. This
framework acts as a proxy to human annotators that can help to both generate a
final labeling instruction set and evaluate its quality. Our framework
generates multiple diverse visual and text representations of dataset
categories. The optimized instruction set outperforms our strongest baseline
across 5 folds by 7.06 mAP for NuImages and 12.9 mAP for COCO.
Related papers
- Scribbles for All: Benchmarking Scribble Supervised Segmentation Across Datasets [51.74296438621836]
We introduce Scribbles for All, a label and training data generation algorithm for semantic segmentation trained on scribble labels.
The main limitation of scribbles as source for weak supervision is the lack of challenging datasets for scribble segmentation.
Scribbles for All provides scribble labels for several popular segmentation datasets and provides an algorithm to automatically generate scribble labels for any dataset with dense annotations.
arXiv Detail & Related papers (2024-08-22T15:29:08Z) - GenQA: Generating Millions of Instructions from a Handful of Prompts [67.54980063851605]
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models.
In this work, we study methods for generating large instruction datasets from a single prompt.
Our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.
arXiv Detail & Related papers (2024-06-14T17:44:08Z) - tasksource: A Dataset Harmonization Framework for Streamlined NLP
Multi-Task Learning and Evaluation [2.869669835645836]
We release a dataset annotation framework and dataset annotations for more than 500 English tasks.
These annotations include metadata, such as the names of columns to be used as input or labels for all datasets.
We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size in an external evaluation.
arXiv Detail & Related papers (2023-01-14T16:38:04Z) - Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic
Filter Attention [7.237370981736913]
We propose a framework to teach any existing convolutional neural network to generate text descriptions about its own latent representations at the filter level.
We show that our method can generate novel descriptions for learned filters beyond the set of categories defined in the training dataset.
We also demonstrate a novel application of our method for unsupervised dataset bias analysis.
arXiv Detail & Related papers (2022-04-10T04:57:56Z) - The Weak Supervision Landscape [5.186945902380689]
We propose a framework for categorising weak supervision settings.
We identify the key elements that characterise weak supervision and devise a series of dimensions that categorise most of the existing approaches.
We show how common settings in the literature fit within the framework and discuss its possible uses in practice.
arXiv Detail & Related papers (2022-03-30T13:19:43Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool [15.268017930901332]
The Data AnnotatoR Tool (DART) is an interactive application that reduces human efforts in annotating large quantities of structured data.
By using a sequence-to-sequence model, our system iteratively analyzes the annotated labels in order to better sample unlabeled data.
In a simulation experiment performed on annotating large quantities of structured data, DART has been shown to reduce the total number of annotations needed with active learning and automatically suggesting relevant labels.
arXiv Detail & Related papers (2020-10-08T17:36:34Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z) - Cross-dataset Training for Class Increasing Object Detection [52.34737978720484]
We present a conceptually simple, flexible and general framework for cross-dataset training in object detection.
By cross-dataset training, existing datasets can be utilized to detect the merged object classes with a single model.
While using cross-dataset training, we only need to label the new classes on the new dataset.
arXiv Detail & Related papers (2020-01-14T04:40:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.