Related papers: VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models

Related papers

Towards Universal Modal Tracking with Online Dense Temporal Token Learning [66.83607018706519]
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning.<n>We expand the model's inputs to a video sequence level, aiming to see a richer video context from a near-global perspective.
arXiv Detail & Related papers (2025-07-27T08:47:42Z)
R1-Track: Direct Application of MLLMs to Visual Object Tracking via Reinforcement Learning [5.59181512260003]
Single object tracking aims to continuously localize and estimate the scale of a target in subsequent video frames.<n>Qwen2.5-VL struggles with template matching between image pairs.<n>Inspired by deep-seek-R1, we fine-tuned Qwen2.5-VL using the group relative policy optimization (GRPO) reinforcement learning method.<n>The resulting model, R1-Track, achieved notable performance on the GOT-10k benchmark.
arXiv Detail & Related papers (2025-06-27T07:41:15Z)
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning [77.34267241692706]
Vision-Language Navigation (VLN) is a core challenge in embodied AI, requiring agents to navigate real-world environments using natural language instructions.<n>We propose VLN-R1, an end-to-end framework that leverages Large Vision-Language Models (LVLM) to directly translate egocentric video streams into continuous navigation actions.
arXiv Detail & Related papers (2025-06-20T17:59:59Z)
Evaluating Cell Type Inference in Vision Language Models Under Varying Visual Context [0.16385815610837165]
Vision-Language Models (VLMs) have rapidly advanced alongside Large Language Models (LLMs)<n>This study evaluates the capabilities of prominent generative VLMs, such as GPT-4.1 and Gemini 2.5 Pro, for histopathology image classification tasks.
arXiv Detail & Related papers (2025-06-15T01:50:16Z)
Renaissance: Investigating the Pretraining of Vision-Language Encoders [0.6445605125467574]
We seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model.
arXiv Detail & Related papers (2024-11-11T01:44:54Z)
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution [82.38677987249348]
We present the Qwen2-VL Series, which redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos.
arXiv Detail & Related papers (2024-09-18T17:59:32Z)
When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective [57.05315507519704]
We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing. Our measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.
arXiv Detail & Related papers (2024-09-03T12:03:45Z)
Advancing Vietnamese Visual Question Answering with Transformer and Convolutional Integration [0.40964539027092917]
This study aims to bridge the gap by conducting experiments on the Vietnamese Visual Question Answering dataset. We have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system. Our experimental findings demonstrate that our model surpasses competing baselines, achieving promising performance.
arXiv Detail & Related papers (2024-07-30T22:32:50Z)
Pretrained Visual Representations in Reinforcement Learning [0.0]
This paper compares the performance of visual reinforcement learning algorithms that train a convolutional neural network (CNN) from scratch with those that utilize pre-trained visual representations (PVRs) We evaluate the Dormant Ratio Minimization (DRM) algorithm, a state-of-the-art visual RL method, against three PVRs: ResNet18, DINOv2, and Visual Cortex (VC)
arXiv Detail & Related papers (2024-07-24T12:53:26Z)
RWKV-CLIP: A Robust Vision-Language Representation Learner [31.501759213619646]
Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks. We introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. We propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs.
arXiv Detail & Related papers (2024-06-11T06:10:46Z)
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z)
Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z)
Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence [36.97507697713224]
We present Eagle (RWKV-5) and Finch (RWKV-6), sequence models improving upon the RWKV (RWKV-4) architecture. Our architectural design advancements include multi-headed matrix-valued states and a dynamic recurrence mechanism. We introduce a new multilingual corpus with 1.12 trillion tokens and a fast tokenizer based on greedy matching for enhanced multilinguality.
arXiv Detail & Related papers (2024-04-08T22:20:59Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z)
The Neglected Tails in Vision-Language Models [51.79913798808725]
We show that vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. We propose REtrieval-Augmented Learning (REAL) to mitigate the imbalanced performance of zero-shot VLMs.
arXiv Detail & Related papers (2024-01-23T01:25:00Z)
Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking. Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos. We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z)
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD) Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z)
Few Shot Activity Recognition Using Variational Inference [9.371378627575883]
We propose a novel variational inference based architectural framework (HF-AR) for few shot activity recognition. Our framework leverages volume-preserving Householder Flow to learn a flexible posterior distribution of the novel classes. This results in better performance as compared to state-of-the-art few shot approaches for human activity recognition.
arXiv Detail & Related papers (2021-08-20T03:57:58Z)
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z)
Rethinking Recurrent Neural Networks and Other Improvements for Image Classification [1.5990720051907859]
We propose integrating an RNN as an additional layer when designing image recognition models. We also develop end-to-end multimodel ensembles that produce expert predictions using several models. Our model sets a new record on the Surrey dataset.
arXiv Detail & Related papers (2020-07-30T00:40:50Z)
A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning [32.59760685342343]
Probabilistic Latent Variable Models provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. In this work, we propose ConvDMM, a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods.
arXiv Detail & Related papers (2020-06-03T21:50:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.