AGGA: A Dataset of Academic Guidelines for Generative AI and Large Language Models
- URL: http://arxiv.org/abs/2501.02063v3
- Date: Tue, 18 Mar 2025 16:45:54 GMT
- Title: AGGA: A Dataset of Academic Guidelines for Generative AI and Large Language Models
- Authors: Junfeng Jiao, Saleh Afroogh, Kevin Chen, David Atkinson, Amit Dhurandhar,
- Abstract summary: This study introduces AGGA, a dataset comprising 80 academic guidelines for the use of Generative AIs (GAIs) and Large Language Models (LLMs) in academic settings.<n>The dataset contains 188,674 words and serves as a valuable resource for natural language processing tasks commonly applied in requirements engineering.
- Score: 8.420666056013685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study introduces AGGA, a dataset comprising 80 academic guidelines for the use of Generative AIs (GAIs) and Large Language Models (LLMs) in academic settings, meticulously collected from official university websites. The dataset contains 188,674 words and serves as a valuable resource for natural language processing tasks commonly applied in requirements engineering, such as model synthesis, abstraction identification, and document structure assessment. Additionally, AGGA can be further annotated to function as a benchmark for various tasks, including ambiguity detection, requirements categorization, and the identification of equivalent requirements. Our methodologically rigorous approach ensured a thorough examination, with a selection of universities that represent a diverse range of global institutions, including top-ranked universities across six continents. The dataset captures perspectives from a variety of academic fields, including humanities, technology, and both public and private institutions, offering a broad spectrum of insights into the integration of GAIs and LLMs in academia.
Related papers
- Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks [56.350173737493215]
Environmental, Social, and Governance (ESG) reports are essential for evaluating sustainability practices, ensuring regulatory compliance, and promoting financial transparency.<n>MMESGBench is a first-of-its-kind benchmark dataset to evaluate multimodal understanding and complex reasoning across structurally diverse and multi-source ESG documents.<n>MMESGBench comprises 933 validated QA pairs derived from 45 ESG documents, spanning across seven distinct document types and three major ESG source categories.
arXiv Detail & Related papers (2025-07-25T03:58:07Z) - From Query to Explanation: Uni-RAG for Multi-Modal Retrieval-Augmented Learning in STEM [35.20687923222239]
We develop a lightweight, efficient multi-modal retrieval module named Uni-Retrieval.<n>It extracts query-style prototypes and dynamically matches them with tokens from a continually updated Prompt Bank.<n>This Prompt Bank encodes and stores domain-specific knowledge by leveraging a Mixture-of-Expert Low-Rank Adaptation (MoE-LoRA) module.<n>We integrate the original Uni-Retrieval with a compact instruction-tuned language model, forming a complete retrieval-augmented generation pipeline named Uni-RAG.
arXiv Detail & Related papers (2025-07-05T02:44:38Z) - A Comparative Study of Task Adaptation Techniques of Large Language Models for Identifying Sustainable Development Goals [39.71115518041856]
This study analyzes various proprietary and open-source text classification models for a single-label, multi-class text classification task focused on the UN's Sustainable Development Goals.<n>The results reveal that smaller models, when optimized through prompt engineering, can perform on par with larger models like OpenAI's GPT.
arXiv Detail & Related papers (2025-06-18T07:42:32Z) - AI-Generated Game Commentary: A Survey and a Datasheet Repository [4.396546075994102]
We introduce a general framework for AIGGC and present a comprehensive survey of 45 existing game commentary dataset and methods.<n>To support future research benchmarking, we also provide a structured appendix, which is meanwhile publicly available in an open repository.
arXiv Detail & Related papers (2025-06-17T07:04:51Z) - IGGA: A Dataset of Industrial Guidelines and Policy Statements for Generative AIs [8.420666056013685]
This paper introduces IGGA, a dataset of 160 industry guidelines and policy statements for the use of Generative AIs (GAIs) and Large Language Models (LLMs) in industry and workplace settings.<n>The dataset contains 104,565 words and serves as a valuable resource for natural language processing tasks commonly applied in requirements engineering.
arXiv Detail & Related papers (2025-01-01T21:31:47Z) - Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST) [19.91873751674613]
GIST is a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023.
The terms are translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation.
This work aims to address critical gaps in AI terminology resources and fosters global inclusivity and collaboration in AI research.
arXiv Detail & Related papers (2024-12-24T11:50:18Z) - From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons [85.99268361356832]
We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA)<n>GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer.<n>Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
arXiv Detail & Related papers (2024-12-11T15:06:25Z) - Personalized Multimodal Large Language Models: A Survey [127.9521218125761]
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities.<n>This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications.
arXiv Detail & Related papers (2024-12-03T03:59:03Z) - A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications [0.0838491111002084]
We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs.
We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards.
arXiv Detail & Related papers (2024-10-08T13:08:07Z) - Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for
Language Models [153.14575887549088]
We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs)
GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines.
With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills.
arXiv Detail & Related papers (2024-02-20T15:00:35Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques.
We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP)
They provide a highly useful, task-agnostic foundation for a wide range of applications.
However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z) - OAG-BERT: Pre-train Heterogeneous Entity-augmented Academic Language
Model [45.419270950610624]
OAG-BERT integrates massive heterogeneous entities including paper, author, concept, venue, and affiliation.
We develop novel pre-training strategies including heterogeneous entity type embedding, entity-aware 2D positional encoding, and span-aware entity masking.
OAG-BERT has been deployed to multiple real-world applications, such as reviewer recommendations for NSFC (National Nature Science Foundation of China) and paper tagging in the AMiner system.
arXiv Detail & Related papers (2021-03-03T14:00:57Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.