Hierarchical Encoders for Modeling and Interpreting Screenplays
- URL: http://arxiv.org/abs/2004.14532v1
- Date: Thu, 30 Apr 2020 01:15:40 GMT
- Title: Hierarchical Encoders for Modeling and Interpreting Screenplays
- Authors: Gayatri Bhat, Avneesh Saluja, Melody Dye, and Jan Florjanczyk
- Abstract summary: We propose a neural architecture for encoding richly structured texts.
This work specifically tackles screenplays, but we discuss how the underlying approach can be generalized to a range of structured documents.
- Score: 1.4674456578222843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While natural language understanding of long-form documents is still an open
challenge, such documents often contain structural information that can inform
the design of models for encoding them. Movie scripts are an example of such
richly structured text - scripts are segmented into scenes, which are further
decomposed into dialogue and descriptive components. In this work, we propose a
neural architecture for encoding this structure, which performs robustly on a
pair of multi-label tag classification datasets, without the need for
handcrafted features. We add a layer of insight by augmenting an unsupervised
"interpretability" module to the encoder, allowing for the extraction and
visualization of narrative trajectories. Though this work specifically tackles
screenplays, we discuss how the underlying approach can be generalized to a
range of structured documents.
Related papers
- The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation [95.18045807704284]
We introduce an end-to-end agentic framework for dialogue-to-cinematic-video generation.<n> ScripterAgent is trained to translate coarse dialogue into a fine-grained, executable cinematic script.<n>Our framework significantly improves script faithfulness and temporal fidelity across all tested video models.
arXiv Detail & Related papers (2026-01-25T08:10:28Z) - Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors [13.56721856255538]
Hieroglyphic Stroke Analyzer (HieroSA) is a framework that transforms logographic and ancient hieroglyphs character images into explicit, interpretable line-segment representations.<n>We show that HieroSA effectively captures character-internal structures and semantics, bypassing the need for language-specific priors.
arXiv Detail & Related papers (2026-01-09T03:30:12Z) - In-Video Instructions: Visual Signals as Generative Control [79.44662698914401]
We investigate whether capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions.<n>In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories.<n>Experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions.
arXiv Detail & Related papers (2025-11-24T18:38:45Z) - Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data.
Document parsing plays an indispensable role in both knowledge base construction and training data generation.
This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model [7.707324214953882]
We introduce SceneScript, a method that produces full scene models as a sequence of structured language commands.
Our method infers the set of structured language commands directly from encoded visual data.
Our method gives state-of-the art results in architectural layout estimation, and competitive results in 3D object detection.
arXiv Detail & Related papers (2024-03-19T18:01:29Z) - From Text Segmentation to Smart Chaptering: A Novel Benchmark for
Structuring Video Transcriptions [63.11097464396147]
We introduce a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse.
We also introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-27T15:59:37Z) - Instruct-SCTG: Guiding Sequential Controlled Text Generation through
Instructions [42.67608830386934]
Instruct-SCTG is a sequential framework that harnesses instruction-tuned language models to generate structurally coherent text.
Our framework generates articles in a section-by-section manner, aligned with the desired human structure using natural language instructions.
arXiv Detail & Related papers (2023-12-19T16:20:49Z) - Redundancy-aware Transformer for Video Question Answering [71.98116071679065]
We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner.
To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames.
As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
arXiv Detail & Related papers (2023-08-07T03:16:24Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - DiffuSIA: A Spiral Interaction Architecture for Encoder-Decoder Text
Diffusion [40.246665336996934]
A spiral interaction architecture for encoder-decoder text diffusion (DiffuSIA) is proposed.
DiffuSIA is evaluated on four text generation tasks, including paraphrase, text simplification, question generation, and open-domain dialogue generation.
arXiv Detail & Related papers (2023-05-19T08:30:11Z) - Unsupervised Learning of Hierarchical Conversation Structure [50.29889385593043]
Goal-oriented conversations often have meaningful sub-dialogue structure, but it can be highly domain-dependent.
This work introduces an unsupervised approach to learning hierarchical conversation structure, including turn and sub-dialogue segment labels.
The decoded structure is shown to be useful in enhancing neural models of language for three conversation-level understanding tasks.
arXiv Detail & Related papers (2022-05-24T17:52:34Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - DOC2PPT: Automatic Presentation Slides Generation from Scientific
Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation.
We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner.
Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.