Related papers: Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories

Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories

URL: http://arxiv.org/abs/2409.12670v1
Date: Thu, 19 Sep 2024 11:30:09 GMT
Title: Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories
Authors: Hikaru Asano, Ryo Yonetani, Taiki Sekii, Hiroki Ouchi,
Abstract summary: This paper presents Text2Traj2Text, a novel learning-by-caption framework for captioning possible contexts behind shopper's trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management.
Score: 11.303926633163117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper's trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.

Related papers

Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning [6.549601823162279]
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP)<n>We explore several adaptation strategies for pre-trained, decoder-only LLMs.
arXiv Detail & Related papers (2025-07-30T14:49:30Z)
SAFT: Structure-aware Transformers for Textual Interaction Classification [15.022958096869734]
Textual interaction networks (TINs) are an omnipresent data structure used to model the interplay between users and items on e-commerce websites, social networks, etc. We propose SAFT, a new architecture that integrates language- and graph-based modules for the effective fusion of textual and structural semantics in the representation learning of interactions.
arXiv Detail & Related papers (2025-04-07T09:19:12Z)
SE-GCL: An Event-Based Simple and Effective Graph Contrastive Learning for Text Representation [23.60337935010744]
We present an event-based, simple, and effective graph contrastive learning (SE-GCL) for text representation. Precisely, we extract event blocks from text and construct internal relation graphs to represent inter-semantic interconnections. In particular, we introduce the concept of an event skeleton for core representation semantics and simplify the typically complex data augmentation techniques.
arXiv Detail & Related papers (2024-12-16T10:53:24Z)
Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks [0.8999666725996978]
We propose a novel RSSC framework that integrates text descriptions generated by large vision-language models (VLMs) as an auxiliary modality without incurring expensive manual annotation costs. Experiments with both quantitative and qualitative evaluation across five RSSC datasets demonstrate that our framework consistently outperforms baseline models.
arXiv Detail & Related papers (2024-12-03T16:24:16Z)
Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z)
Instruction Data Generation and Unsupervised Adaptation for Speech Language Models [21.56355461403427]
We propose three methods for generating synthetic samples to train and evaluate multimodal large language models. Synthetic data generation emerges as a crucial strategy to enhance the performance of such systems. We highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions.
arXiv Detail & Related papers (2024-06-18T08:27:00Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Context Unlocks Emotions: Text-based Emotion Classification Dataset Auditing with Large Language Models [23.670143829183104]
The lack of contextual information in text data can make the annotation process of text-based emotion classification datasets challenging. We propose a formal definition of textual context to motivate a prompting strategy to enhance such contextual information. Our method improves alignment between inputs and their human-annotated labels from both an empirical and human-evaluated standpoint.
arXiv Detail & Related papers (2023-11-06T21:34:49Z)
Enhancing Semantic Understanding with Self-supervised Methods for Abstractive Dialogue Summarization [4.226093500082746]
We introduce self-supervised methods to compensate shortcomings to train a dialogue summarization model. Our principle is to detect incoherent information flows using pretext dialogue text to enhance BERT's ability to contextualize the dialogue text representations.
arXiv Detail & Related papers (2022-09-01T07:51:46Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.