Related papers: Aligning Actions and Walking to LLM-Generated Textual Descriptions

Aligning Actions and Walking to LLM-Generated Textual Descriptions

URL: http://arxiv.org/abs/2404.12192v1
Date: Thu, 18 Apr 2024 13:56:03 GMT
Title: Aligning Actions and Walking to LLM-Generated Textual Descriptions
Authors: Radu Chivereanu, Adrian Cosma, Andy Catruna, Razvan Rughinis, Emilian Radoi,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns.
Score: 3.1049440318608568
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains, including data augmentation and synthetic data generation. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns. We leverage the expressive power of LLMs to align motion representations with high-level linguistic cues, addressing two distinct tasks: action recognition and retrieval of walking sequences based on appearance attributes. For action recognition, we employ LLMs to generate textual descriptions of actions in the BABEL-60 dataset, facilitating the alignment of motion sequences with linguistic representations. In the domain of gait analysis, we investigate the impact of appearance attributes on walking patterns by generating textual descriptions of motion sequences from the DenseGait dataset using LLMs. These descriptions capture subtle variations in walking styles influenced by factors such as clothing choices and footwear. Our approach demonstrates the potential of LLMs in augmenting structured motion attributes and aligning multi-modal representations. The findings contribute to the advancement of comprehensive motion understanding and open up new avenues for leveraging LLMs in multi-modal alignment and data augmentation for motion analysis. We make the code publicly available at https://github.com/Radu1999/WalkAndText

Related papers

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding [63.09928907734156]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings. Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z)
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation [8.768484848591168]
SceneLLM is a framework that transforms video frames into linguistic signals (scene tokens) Our method achieves state-of-the-art results on the Action Genome (AG) benchmark. Extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.
arXiv Detail & Related papers (2024-12-15T02:41:31Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language. This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok) SeTok groups visual features into semantic units via a dynamic clustering algorithm. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
Bridging Vision and Language Spaces with Assignment Prediction [47.04855334955006]
VLAP is a novel approach that bridges pretrained vision models and large language models (LLMs) We harness well-established word embeddings to bridge two modality embedding spaces. VLAP achieves substantial improvements over the previous linear transformation-based approaches.
arXiv Detail & Related papers (2024-04-15T10:04:15Z)
LLM Attributor: Interactive Visual Attribution for LLM Generation [29.116016627864095]
Python library provides interactive visualizations for training data attribution of large language models. Our library offers a new way to quickly attribute an LLM's text generation to training data points.
arXiv Detail & Related papers (2024-04-01T13:16:34Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
Large Language Model with Graph Convolution for Recommendation [21.145230388035277]
Text information can sometimes be of low quality, hindering its effectiveness for real-world applications. With knowledge and reasoning capabilities capsuled in Large Language Models, utilizing LLMs emerges as a promising way for description improvement. We propose a Graph-aware Convolutional LLM method to elicit LLMs to capture high-order relations in the user-item graph.
arXiv Detail & Related papers (2024-02-14T00:04:33Z)
InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks. InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language. Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image. Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z)
Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks. Our method achieves state-of-the-art results on well-established TAG datasets. Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.