Related papers: Web Archives Metadata Generation with GPT-4o: Challenges and Insights

Web Archives Metadata Generation with GPT-4o: Challenges and Insights

URL: http://arxiv.org/abs/2411.05409v2
Date: Sat, 16 Nov 2024 02:27:09 GMT
Title: Web Archives Metadata Generation with GPT-4o: Challenges and Insights
Authors: Abigail Yongping Huang, Ashwin Nair, Zhen Rong Goh, Tianrui Liu,
Abstract summary: This paper explores the use ofgpt-4o for metadata generation within the Web Singapore Archive. We processed 112 Web ARChive (WARC) files using data reduction techniques, achieving a notable 99.9% reduction in metadata generation costs. The study identifies key challenges including content inaccuracies, hallucinations, and translation issues, suggesting that Large Language Models (LLMs) should serve as complements rather than replacements for human cataloguers.
Score: 2.45723043286596
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current metadata creation for web archives is time consuming and costly due to reliance on human effort. This paper explores the use of gpt-4o for metadata generation within the Web Archive Singapore, focusing on scalability, efficiency, and cost effectiveness. We processed 112 Web ARChive (WARC) files using data reduction techniques, achieving a notable 99.9% reduction in metadata generation costs. By prompt engineering, we generated titles and abstracts, which were evaluated both intrinsically using Levenshtein Distance and BERTScore, and extrinsically with human cataloguers using McNemar's test. Results indicate that while our method offers significant cost savings and efficiency gains, human curated metadata maintains an edge in quality. The study identifies key challenges including content inaccuracies, hallucinations, and translation issues, suggesting that Large Language Models (LLMs) should serve as complements rather than replacements for human cataloguers. Future work will focus on refining prompts, improving content filtering, and addressing privacy concerns through experimentation with smaller models. This research advances the integration of LLMs in web archiving, offering valuable insights into their current capabilities and outlining directions for future enhancements. The code is available at https://github.com/masamune-prog/warc2summary for further development and use by institutions facing similar challenges.

Related papers

Exploring LLM Capabilities in Extracting DCAT-Compatible Metadata for Data Cataloging [0.1424853531377145]
Data catalogs can support and accelerate data exploration by using metadata to answer user queries.<n>This study investigates whether LLMs can automate metadata maintenance of text-based data and generate high-quality DCAT-compatible metadata.<n>Our results show that LLMs can generate metadata comparable to human-created content, particularly on tasks that require advanced semantic understanding.
arXiv Detail & Related papers (2025-07-04T10:49:37Z)
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
Automated Archival Descriptions with Federated Intelligence of LLMs [2.271344459418284]
This work aims at exploring the potential of agentic AI and large language models (LLMs) in addressing the challenges of implementing a standardized archival description process. We introduce an agentic AI-driven system for automated generation of high-quality metadata descriptions of archival materials.
arXiv Detail & Related papers (2025-04-08T06:11:05Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Zero-Indexing Internet Search Augmented Generation for Large Language Models [15.138260067336455]
Retrieval augmented generation has emerged as an effective method to enhance large language model performance. This approach typically relies on an internal retrieval module that uses various indexing mechanisms to manage a static pre-processed corpus. In this paper, we explore an alternative approach that leverages standard search engine APIs to dynamically integrate the latest online information.
arXiv Detail & Related papers (2024-11-29T05:31:04Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
Commit message generation is a crucial task in software engineering that is challenging to evaluate correctly. We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z)
Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z)
Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models. We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z)
Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment [0.0]
We propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini. We evaluate the impact of contextual information (i.e., dataset description) on the classification outcomes.
arXiv Detail & Related papers (2024-03-01T10:01:36Z)
Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data [22.461461600306688]
This work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. We devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset.
arXiv Detail & Related papers (2023-12-05T01:19:30Z)
Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue [18.325675189960833]
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research. As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them. This paper proposes to leverage large language models (LLMs) for cost-effective annotation of subject metadata through the LLM-based in-context learning.
arXiv Detail & Related papers (2023-10-17T14:52:33Z)
Can LLMs Augment Low-Resource Reading Comprehension Datasets? Opportunities and Challenges [3.130575840003799]
GPT-4 can be used to augment existing reading comprehension datasets. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems.
arXiv Detail & Related papers (2023-09-21T18:48:02Z)
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs) We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score) Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z)
Improving Language Models via Plug-and-Play Retrieval Feedback [42.786225163763376]
Large language models (LLMs) exhibit remarkable performance across various NLP tasks. They often generate incorrect or hallucinated information, which hinders their practical applicability in real-world scenarios. We introduce ReFeed, a novel pipeline designed to enhance LLMs by providing automatic retrieval feedback in a plug-and-play framework.
arXiv Detail & Related papers (2023-05-23T12:29:44Z)
Advanced Data Augmentation Approaches: A Comprehensive Survey and Future directions [57.30984060215482]
We provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique. We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2023-01-07T11:37:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.