An Architecture-Led Hybrid Report on Body Language Detection Project
- URL: http://arxiv.org/abs/2512.23028v1
- Date: Sun, 28 Dec 2025 18:03:00 GMT
- Title: An Architecture-Led Hybrid Report on Body Language Detection Project
- Authors: Thomson Tong, Diba Darooneh,
- Abstract summary: This report provides an architecture-led analysis of two modern vision-language models (VLMs)<n>It explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
Related papers
- Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding [35.429403152845836]
Youtu-Parsing is an efficient and versatile document parsing model designed for high-performance content extraction.<n>The model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content.<n>Youtu-Parsing achieves state-of-the-art (SOTA) performance on the OmniDocBench and olmOCR-bench benchmarks.
arXiv Detail & Related papers (2026-01-28T09:37:13Z) - PARL: Position-Aware Relation Learning Network for Document Layout Analysis [23.497081928689525]
We argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure.<n>We propose a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure.<n>Experiments show that PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models.
arXiv Detail & Related papers (2026-01-12T15:05:35Z) - Referring Video Object Segmentation with Cross-Modality Proxy Queries [23.504655272754587]
Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions.<n>Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism.<n>We propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics.
arXiv Detail & Related papers (2025-11-26T07:45:41Z) - MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.05126590825121]
MonkeyOCR v1.5 is a unified vision-language framework that enhances both layout understanding and content recognition.<n>To address complex table structures, we propose a visual consistency-based reinforcement learning scheme.<n>Two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables.
arXiv Detail & Related papers (2025-11-13T15:12:17Z) - IUT-Plug: A Plug-in tool for Interleaved Image-Text Generation [23.61167100602915]
IUT-Plug is a module grounded in an Image Understanding Tree (IUT)<n>A dynamic IUT-Plug extraction module parses visual scenes into hierarchical symbolic structures.<n>A coordinated narrative-flow and image synthesis mechanism ensures cross-modal consistency.
arXiv Detail & Related papers (2025-10-13T03:19:45Z) - EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing [170.71134330650796]
EdiVal-Agent is an object-centric evaluation framework for instruction-based image editing.<n>It is designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision.<n>We build EdiVal-Bench, a benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms.
arXiv Detail & Related papers (2025-09-16T17:45:39Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.<n>We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 13 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - VSR: A Unified Framework for Document Layout Analysis combining Vision,
Semantics and Relations [40.721146438291335]
We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations.
On three popular benchmarks, VSR outperforms previous models by large margins.
arXiv Detail & Related papers (2021-05-13T12:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.