Related papers: pose-format: Library for Viewing, Augmenting, and Handling .pose Files

Related papers

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations [39.98860473310998]
ColParse is a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings.<n>Experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains.
arXiv Detail & Related papers (2026-03-02T09:55:00Z)
Text images processing system using artificial intelligence models [0.0]
The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it.<n>The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset.
arXiv Detail & Related papers (2025-12-12T16:15:34Z)
Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing [46.14775667559124]
Document parsing from scanned images remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.<n>Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data.<n>We introduce LayoutRL, a reinforcement learning framework that optimize layout understanding through composite rewards integrating normalized edit distance count accuracy, and reading order preservation.<n>We show that Infinity-Bench consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities.
arXiv Detail & Related papers (2025-10-17T06:26:59Z)
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting [20.588630224794976]
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.<n>We present textitDolphin, a novel multimodal document image parsing model following an analyze-then-parse paradigm.<n>Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency.
arXiv Detail & Related papers (2025-05-20T08:03:59Z)
QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder. We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents [7.358946120326249]
We introduce 'Eclair, a text-extraction tool specifically designed to process a wide range of document types. Given an image, 'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. 'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics.
arXiv Detail & Related papers (2025-02-06T17:07:22Z)
Performance Evaluation of Geospatial Images based on Zarr and Tiff [0.0]
This evaluate the performance of geospatial image processing using two distinct data storage formats: Zarr and TIFF. Traditional Tagged Image File Format is mostly used because it is simple and compatible but may lack by performance limitations while working on large datasets. Zar is a new format designed for the cloud systems,that offers scalability and efficient storage with data chunking and compression techniques.
arXiv Detail & Related papers (2024-11-18T05:34:31Z)
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data. Document parsing plays an indispensable role in both knowledge base construction and training data generation. This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
We propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images.<n>First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.<n>Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
PyTorch-IE: Fast and Reproducible Prototyping for Information Extraction [6.308539010172309]
PyTorch-IE is a framework designed to enable swift, reproducible, and reusable implementations of Information Extraction models. We propose task modules to decouple the concerns of data representation and model-specific representations. PyTorch-IE also extends support for widely used libraries such as PyTorch-Lightning for training, HuggingFace datasets for dataset reading, and Hydra for experiment configuration.
arXiv Detail & Related papers (2024-05-16T12:23:37Z)
OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks. In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z)
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions [79.72930339711478]
$textbfpyvene$ is an open-source library that supports customizable interventions on a range of different PyTorch modules. We show how $textbfpyvene$ provides a unified framework for performing interventions on neural models and sharing the intervened upon models with others.
arXiv Detail & Related papers (2024-03-12T16:46:54Z)
UniSparse: An Intermediate Language for General Sparse Format Customization [13.132033187592349]
We propose UniSparse, an intermediate language that provides a unified abstraction for representing and customizing sparse formats. Unlike the existing attribute-based frameworks, UniSparse decouples the logical representation of the sparse tensor from its low-level memory layout. As a result, a rich set of format customizations can be succinctly expressed in a small set of well-defined query, mutation, and layout primitives.
arXiv Detail & Related papers (2024-03-09T05:38:45Z)
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks. We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z)
Collection Space Navigator: An Interactive Visualization Interface for Multidimensional Datasets [0.0]
Collection Space Navigator (CSN) is a browser-based visualization tool to explore, research, and curate large collections of visual digital artifacts. CSN provides a customizable interface that combines two-dimensional projections with a set of multidimensional filters. Users can reconfigure the interface to fit their own data and research needs, including projections and filter controls.
arXiv Detail & Related papers (2023-05-11T14:03:26Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
Augraphy: A Data Augmentation Library for Document Images [59.457999432618614]
Augraphy is a Python library for constructing data augmentation pipelines. It provides strategies to produce augmented versions of clean document images that appear to have been altered by standard office operations.
arXiv Detail & Related papers (2022-08-30T22:36:19Z)
DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z)
GFTE: Graph-based Financial Table Extraction [66.26206038522339]
In financial industry and many other fields, tables are often disclosed in unstructured digital files, e.g. Portable Document Format (PDF) and images. We publish a standard Chinese dataset named FinTab, which contains more than 1,600 financial tables of diverse kinds. We propose a novel graph-based convolutional network model named GFTE as a baseline for future comparison.
arXiv Detail & Related papers (2020-03-17T07:10:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.