Related papers: SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization

URL: http://arxiv.org/abs/2410.21411v1
Date: Mon, 28 Oct 2024 18:10:26 GMT
Title: SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
Authors: Wanhua Li, Zibin Meng, Jiawei Zhou, Donglai Wei, Chuang Gan, Hanspeter Pfister,
Abstract summary: Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
Score: 70.11167263638562
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. While current methods adopt the paradigm of training a dedicated network end-to-end using labeled image data, they are limited in terms of generalizability and interpretability. To address these issues, we first present a simple yet well-crafted framework named {\name}, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework, providing a strong baseline for social relation recognition. Specifically, we instruct VFMs to translate image content into a textual social story, and then utilize LLMs for text-based reasoning. {\name} introduces systematic design principles to adapt VFMs and LLMs separately and bridge their gaps. Without additional model training, it achieves competitive zero-shot results on two databases while offering interpretable answers, as LLMs can generate language-based explanations for the decisions. The manual prompt design process for LLMs at the reasoning phase is tedious and an automated prompt optimization method is desired. As we essentially convert a visual classification task into a generative task of LLMs, automatic prompt optimization encounters a unique long prompt optimization issue. To address this issue, we further propose the Greedy Segment Prompt Optimization (GSPO), which performs a greedy search by utilizing gradient information at the segment level. Experimental results show that GSPO significantly improves performance, and our method also generalizes to different image styles. The code is available at https://github.com/Mengzibin/SocialGPT.

Related papers

MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval [50.062817677022586]
Zero-Shot Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens.<n>We propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI) to construct two complementary training tasks using only unlabeled images.
arXiv Detail & Related papers (2025-05-26T08:56:59Z)
Decoupled Visual Interpretation and Linguistic Reasoning for Math Problem Solving [57.22004912994658]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z)
Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search [35.20525123189316]
Session search involves a series of interactive queries and actions to fulfill user's complex information need.<n>Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions.<n>We propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches.
arXiv Detail & Related papers (2025-05-20T10:05:06Z)
Rethinking Graph Structure Learning in the Era of LLMs [29.867262599990227]
Large Language and Tree Assistant (LLaTA) leverages tree-based LLM in-context learning to enhance the understanding of topology and text. Extensive experiments on 10 datasets demonstrate that LLaTA enjoys flexibility-incorporated with any backbone.
arXiv Detail & Related papers (2025-03-27T07:28:30Z)
VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making [21.61801132083334]
VIPER is a novel framework for multimodal instruction-based planning. It integrates VLM-based perception with LLM-based reasoning. We show that VIPER significantly outperforms state-of-the-art visual instruction-based planners.
arXiv Detail & Related papers (2025-03-19T11:05:42Z)
DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation [2.7624021966289605]
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. We propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios.
arXiv Detail & Related papers (2025-03-06T01:42:28Z)
Training Large Recommendation Models via Graph-Language Token Alignment [53.3142545812349]
We propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment. By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs. Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction.
arXiv Detail & Related papers (2025-02-26T02:19:10Z)
Filter-then-Generate: Large Language Models with Structure-Text Adapter for Knowledge Graph Completion [20.973071287301067]
Large Language Models (LLMs) present massive inherent knowledge and superior semantic comprehension capability. Empirical evidence suggests that LLMs consistently perform worse than conventional knowledge graph completion approaches. We propose a novel instruction-tuning-based method, namely FtG, to address these challenges.
arXiv Detail & Related papers (2024-12-12T09:22:04Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond [99.73306923465424]
We introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images. By memorizing images in MLLMs, we introduce a new paradigm to cross-modal retrieval, distinct from previous discriminative approaches.
arXiv Detail & Related papers (2024-02-16T16:31:46Z)
Large Language Model with Graph Convolution for Recommendation [21.145230388035277]
Text information can sometimes be of low quality, hindering its effectiveness for real-world applications. With knowledge and reasoning capabilities capsuled in Large Language Models, utilizing LLMs emerges as a promising way for description improvement. We propose a Graph-aware Convolutional LLM method to elicit LLMs to capture high-order relations in the user-item graph.
arXiv Detail & Related papers (2024-02-14T00:04:33Z)
CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z)
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z)
Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions [15.262736501208467]
Large Language Models (LLMs) demonstrate impressive reasoning ability and the maintenance of world knowledge. As images are invisible to LLMs, researchers convert images to text to engage LLMs into the visual question reasoning procedure. We design a framework that enables LLMs to proactively ask relevant questions to unveil more details in the image.
arXiv Detail & Related papers (2023-11-20T08:23:39Z)
InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest. This work delves into enabling LLMs to tackle more vision-language-related tasks. InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.