Related papers: Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming

URL: http://arxiv.org/abs/2203.01382v1
Date: Wed, 2 Mar 2022 19:57:32 GMT
Title: Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming
Authors: Cheng-Yu Hsieh, Jieyu Zhang, Alexander Ratner
Abstract summary: We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
Score: 77.38174112525168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weak Supervision (WS) techniques allow users to efficiently create large training datasets by programmatically labeling data with heuristic sources of supervision. While the success of WS relies heavily on the provided labeling heuristics, the process of how these heuristics are created in practice has remained under-explored. In this work, we formalize the development process of labeling heuristics as an interactive procedure, built around the existing workflow where users draw ideas from a selected set of development data for designing the heuristic sources. With the formalism, we study two core problems of how to strategically select the development data to guide users in efficiently creating informative heuristics, and how to exploit the information within the development process to contextualize and better learn from the resultant heuristics. Building upon two novel methodologies that effectively tackle the respective problems considered, we present Nemo, an end-to-end interactive system that improves the overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS approach.

Related papers

Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation [54.945281159783896]
We present a scalable pipeline for automatically generating high-quality training data for web agents.<n>We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion.
arXiv Detail & Related papers (2026-02-13T02:52:18Z)
Data Science and Technology Towards AGI Part I: Tiered Data Management [53.64581824953229]
We argue that the development of artificial intelligence is entering a new phase of data-model co-evolution.<n>We introduce an L0-L4 tiered data management framework, ranging from raw uncurated resources to organized and verifiable knowledge.<n>We validate the effectiveness of the proposed framework through empirical studies.
arXiv Detail & Related papers (2026-02-09T18:47:51Z)
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z)
Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning. We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads. We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments [33.83610929282721]
Learn-by-interact is a data-centric framework to adapt large language models (LLMs) to any given environments without human annotations. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL) Experiments on SWE-bench, WebArena, OSWorld and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of Learn-by-interact.
arXiv Detail & Related papers (2025-01-18T22:34:41Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models [64.28420991770382]
We present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators. The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Intelligent Spark Agents: A Modular LangGraph Framework for Scalable, Visualized, and Enhanced Big Data Machine Learning Workflows [1.4582633500696451]
LangGraph framework is designed to enhance machine learning through scalability, visualization, and intelligent process optimization. At its core, the framework introduces Agent AI, a pivotal innovation that leverages Spark's distributed computing capabilities. The framework also incorporates large language models through the LangChain ecosystem, enhancing interaction with unstructured data.
arXiv Detail & Related papers (2024-12-02T13:41:38Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Process-aware Human Activity Recognition [1.912429179274357]
We propose a novel approach that incorporates process information from context to enhance the HAR performance. Specifically, we align probabilistic events generated by machine learning models with process models derived from contextual information. This alignment adaptively weighs these two sources of information to optimise HAR accuracy.
arXiv Detail & Related papers (2024-11-13T17:53:23Z)
Collaborative Evolving Strategy for Automatic Data-Centric Development [17.962373755266068]
This paper introduces the automatic data-centric development (AD2) task. It outlines its core challenges, which require domain-experts-like task scheduling and implementation capability. We propose an autonomous agent equipped with a strategy named Collaborative Knowledge-STudying-Enhanced Evolution by Retrieval.
arXiv Detail & Related papers (2024-07-26T12:16:47Z)
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations. Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
Learning Context-Aware Service Representation for Service Recommendation in Workflow Composition [6.17189383632496]
This paper proposes a novel NLP-inspired approach to recommending services throughout a workflow development process. A workflow composition process is formalized as a step-wise, context-aware service generation procedure. Service embeddings are then learned by applying deep learning model from the NLP field.
arXiv Detail & Related papers (2022-05-24T04:18:01Z)
SemTUI: a Framework for the Interactive Semantic Enrichment of Tabular Data [0.0]
SemTUI is a framework to make the enrichment process flexible, complete, and effective through the use of semantics. A task-driven user evaluation proved SemTUI to be understandable, usable, and capable of achieving table enrichment with little effort and time.
arXiv Detail & Related papers (2022-03-17T17:14:21Z)
Learning to Continuously Optimize Wireless Resource in a Dynamic Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment. We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes. Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
Mining Implicit Entity Preference from User-Item Interaction Data for Knowledge Graph Completion via Adversarial Learning [82.46332224556257]
We propose a novel adversarial learning approach by leveraging user interaction data for the Knowledge Graph Completion task. Our generator is isolated from user interaction data, and serves to improve the performance of the discriminator. To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks.
arXiv Detail & Related papers (2020-03-28T05:47:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.