Related papers: Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach

Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach

URL: http://arxiv.org/abs/2108.02399v1
Date: Thu, 5 Aug 2021 06:28:32 GMT
Title: Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach
Authors: Zeren Sun, Yazhou Yao, Xiu-Shen Wei, Yongshun Zhang, Fumin Shen, Jianxin Wu, Jian Zhang, Heng-Tao Shen
Abstract summary: We construct two new benchmark webly supervised fine-grained datasets, WebFG-496 and WebiNat-5089, respectively. For WebiNat-5089, it contains 5089 sub-categories and more than 1.1 million web training images, which is the largest webly supervised fine-grained dataset ever. As a minor contribution, we also propose a novel webly supervised method (termed Peer-learning'') for benchmarking these datasets.
Score: 115.91099791629104
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning from the web can ease the extreme dependence of deep learning on large-scale manually labeled datasets. Especially for fine-grained recognition, which targets at distinguishing subordinate categories, it will significantly reduce the labeling costs by leveraging free web data. Despite its significant practical and research value, the webly supervised fine-grained recognition problem is not extensively studied in the computer vision community, largely due to the lack of high-quality datasets. To fill this gap, in this paper we construct two new benchmark webly supervised fine-grained datasets, termed WebFG-496 and WebiNat-5089, respectively. In concretely, WebFG-496 consists of three sub-datasets containing a total of 53,339 web training images with 200 species of birds (Web-bird), 100 types of aircrafts (Web-aircraft), and 196 models of cars (Web-car). For WebiNat-5089, it contains 5089 sub-categories and more than 1.1 million web training images, which is the largest webly supervised fine-grained dataset ever. As a minor contribution, we also propose a novel webly supervised method (termed ``{Peer-learning}'') for benchmarking these datasets.~Comprehensive experimental results and analyses on two new benchmark datasets demonstrate that the proposed method achieves superior performance over the competing baseline models and states-of-the-art. Our benchmark datasets and the source codes of Peer-learning have been made available at {\url{https://github.com/NUST-Machine-Intelligence-Laboratory/weblyFG-dataset}}.

Related papers

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search [42.560419395815146]
Vision-Language Models have made significant progress on many perception-focused tasks. However, their progress on reasoning-focused tasks remains limited due to the lack of high-quality and diverse training data. We propose VisualWebInstruct to create a diverse and high-quality dataset spanning multiple disciplines.
arXiv Detail & Related papers (2025-03-13T17:32:48Z)
GneissWeb: Preparing High Quality Data for LLMs at Scale [15.596915267015797]
We introduce GneissWeb, a large dataset yielding around 10 trillion tokens. GneissWeb achieves a favorable trade-off between data quality and quantity. We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0.
arXiv Detail & Related papers (2025-02-19T00:14:29Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Decorrelating Structure via Adapters Makes Ensemble Learning Practical for Semi-supervised Learning [50.868594148443215]
In computer vision, traditional ensemble learning methods exhibit either a low training efficiency or the limited performance. We propose a lightweight, loss-function-free, and architecture-agnostic ensemble learning by the Decorrelating Structure via Adapters (DSA) for various visual tasks.
arXiv Detail & Related papers (2024-08-08T01:31:38Z)
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale [30.955171096569618]
FineWeb is a 15-trillion token dataset derived from 96 Common Crawl snapshots. FineWeb-Edu is a 1.3-trillion token collection of educational text filtered from FineWeb.
arXiv Detail & Related papers (2024-06-25T13:50:56Z)
WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs [49.91550773480978]
This paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs.
arXiv Detail & Related papers (2024-04-09T15:05:48Z)
From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web [118.67589717634281]
Continual learning often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice. We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation. Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification.
arXiv Detail & Related papers (2023-11-19T10:43:43Z)
ELFIS: Expert Learning for Fine-grained Image Recognition Using Subsets [6.632855264705276]
We propose ELFIS, an expert learning framework for Fine-Grained Visual Recognition. A set of neural networks-based experts are trained focusing on the meta-categories and are integrated into a multi-task framework. Experiments show improvements in the SoTA FGVR benchmarks of up to +1.3% of accuracy using both CNNs and transformer-based networks.
arXiv Detail & Related papers (2023-03-16T12:45:19Z)
GROWN+UP: A Graph Representation Of a Webpage Network Utilizing Pre-training [0.2538209532048866]
We introduce an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually. We show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification.
arXiv Detail & Related papers (2022-08-03T13:37:27Z)
The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety. We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page. Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z)
On The State of Data In Computer Vision: Human Annotations Remain Indispensable for Developing Deep Learning Models [0.0]
High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML) Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant. Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
arXiv Detail & Related papers (2021-07-31T00:08:21Z)
NWPU-Crowd: A Large-Scale Benchmark for Crowd Counting and Localization [101.13851473792334]
We construct a large-scale congested crowd counting and localization dataset, NWPU-Crowd, consisting of 5,109 images, in a total of 2,133,375 annotated heads with points and boxes. Compared with other real-world datasets, it contains various illumination scenes and has the largest density range (020,033) We describe the data characteristics, evaluate the performance of some mainstream state-of-the-art (SOTA) methods, and analyze the new problems that arise on the new data.
arXiv Detail & Related papers (2020-01-10T09:26:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.