Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories
- URL: http://arxiv.org/abs/2409.12670v1
- Date: Thu, 19 Sep 2024 11:30:09 GMT
- Title: Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories
- Authors: Hikaru Asano, Ryo Yonetani, Taiki Sekii, Hiroki Ouchi,
- Abstract summary: This paper presents Text2Traj2Text, a novel learning-by-caption framework for captioning possible contexts behind shopper's trajectory data in retail stores.
Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management.
- Score: 11.303926633163117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper's trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Instruction Data Generation and Unsupervised Adaptation for Speech Language Models [21.56355461403427]
We propose three methods for generating synthetic samples to train and evaluate multimodal large language models.
Synthetic data generation emerges as a crucial strategy to enhance the performance of such systems.
We highlight the potential of using unlabeled speech data to generate synthetic samples comparable in quality to those with available transcriptions.
arXiv Detail & Related papers (2024-06-18T08:27:00Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Context Unlocks Emotions: Text-based Emotion Classification Dataset
Auditing with Large Language Models [23.670143829183104]
The lack of contextual information in text data can make the annotation process of text-based emotion classification datasets challenging.
We propose a formal definition of textual context to motivate a prompting strategy to enhance such contextual information.
Our method improves alignment between inputs and their human-annotated labels from both an empirical and human-evaluated standpoint.
arXiv Detail & Related papers (2023-11-06T21:34:49Z) - Enhancing Semantic Understanding with Self-supervised Methods for
Abstractive Dialogue Summarization [4.226093500082746]
We introduce self-supervised methods to compensate shortcomings to train a dialogue summarization model.
Our principle is to detect incoherent information flows using pretext dialogue text to enhance BERT's ability to contextualize the dialogue text representations.
arXiv Detail & Related papers (2022-09-01T07:51:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.