A Case for Computing on Unstructured Data
- URL: http://arxiv.org/abs/2509.14601v1
- Date: Thu, 18 Sep 2025 04:24:41 GMT
- Title: A Case for Computing on Unstructured Data
- Authors: Mushtari Sadia, Amrita Roy Chowdhury, Ang Chen,
- Abstract summary: We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats.<n>This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption.
- Score: 6.425984481490725
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats. This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption. We illustrate this paradigm through two use cases and present the research components that need to be developed in a new data system called MXFlow.
Related papers
- OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas [57.49565459553627]
We introduce OmniStruct, a benchmark for assessing Large Language Models' capabilities on text-to-structure tasks.<n>We collect high-quality training data via synthetic task generation to facilitate the development of efficient text-to-structure models.<n>Our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models.
arXiv Detail & Related papers (2025-11-23T08:18:12Z) - From Chaos to Automation: Enabling the Use of Unstructured Data for Robotic Process Automation [0.6144680854063939]
The UNstructured Document REtrieval SyStem (UNDRESS) is a system that uses fuzzy regular expressions, techniques for natural language processing, and large language models to enable RPA platforms to effectively retrieve information from unstructured documents.<n>The results demonstrate the effectiveness of UNDRESS in enhancing RPA capabilities for unstructured data, providing a significant advancement in the field.
arXiv Detail & Related papers (2025-07-15T14:32:49Z) - A Unifying Framework for Robust and Efficient Inference with Unstructured Data [2.07180164747172]
This paper presents a general framework for conducting efficient inference on parameters derived from unstructured data.<n>We formalize this approach with MAR-S, a framework that unifies and extends existing methods for debiased inference.<n>Within this framework, we develop robust and efficient estimators for both descriptive and causal estimands.
arXiv Detail & Related papers (2025-05-01T04:11:25Z) - The Effectiveness of Large Language Models in Transforming Unstructured Text to Standardized Formats [0.0]
This study systematically evaluating Large Language Models' ability to convert unstructured text into structured formats.<n>Experiments reveal that GPT-4o with few-shot prompting achieves breakthrough performance.<n>These findings open new possibilities for automated structured data generation across various domains.
arXiv Detail & Related papers (2025-03-04T14:14:28Z) - Unifying Structured Data as Graph for Data-to-Text Pre-Training [69.96195162337793]
Data-to-text (D2T) generation aims to transform structured data into natural language text.
Data-to-text pre-training has proved to be powerful in enhancing D2T generation.
We propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer.
arXiv Detail & Related papers (2024-01-02T12:23:49Z) - StructRe: Rewriting for Structured Shape Modeling [60.20359722058389]
We present StructRe, a structure rewriting system, as a novel approach to structured shape modeling.<n>Given a 3D object represented by points and components, StructRe can rewrite it upward into more concise structures, or downward into more detailed structures.
arXiv Detail & Related papers (2023-11-29T10:35:00Z) - Cross Modal Data Discovery over Structured and Unstructured Data Lakes [5.270224494298927]
Organizations are collecting increasingly large amounts of data for data driven decision making.
These data are often dumped into a centralized repository, consisting of thousands of structured and unstructured datasets.
Perversely, such mixture of datasets makes the problem of discovering elements relevant to a user's query or an analytical task very challenging.
arXiv Detail & Related papers (2023-06-01T17:34:42Z) - StructGPT: A General Framework for Large Language Model to Reason over
Structured Data [117.13986738340027]
We develop an emphIterative Reading-then-Reasoning(IRR) approach for solving question answering tasks based on structured data.
Our approach can significantly boost the performance of ChatGPT and achieve comparable performance against the full-data supervised-tuning baselines.
arXiv Detail & Related papers (2023-05-16T17:45:23Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.