LAVIS: A Library for Language-Vision Intelligence
- URL: http://arxiv.org/abs/2209.09019v1
- Date: Thu, 15 Sep 2022 18:04:10 GMT
- Title: LAVIS: A Library for Language-Vision Intelligence
- Authors: Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven
C.H. Hoi
- Abstract summary: LAVIS is an open-source library for LAnguage-VISion research and applications.
It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets.
- Score: 98.88477610704938
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce LAVIS, an open-source deep learning library for LAnguage-VISion
research and applications. LAVIS aims to serve as a one-stop comprehensive
library that brings recent advancements in the language-vision field accessible
for researchers and practitioners, as well as fertilizing future research and
development. It features a unified interface to easily access state-of-the-art
image-language, video-language models and common datasets. LAVIS supports
training, evaluation and benchmarking on a rich variety of tasks, including
multimodal classification, retrieval, captioning, visual question answering,
dialogue and pre-training. In the meantime, the library is also highly
extensible and configurable, facilitating future development and customization.
In this technical report, we describe design principles, key components and
functionalities of the library, and also present benchmarking results across
common language-vision tasks. The library is available at:
https://github.com/salesforce/LAVIS.
Related papers
- A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain [3.9519587827662397]
We focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks.
We consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines.
arXiv Detail & Related papers (2024-11-06T07:54:10Z) - Visual Navigation of Digital Libraries: Retrieval and Classification of Images in the National Library of Norway's Digitised Book Collection [0.3277163122167433]
We present a proof-of-concept image search application for exploring images in the National Library of Norway's pre-1900 books.
We compare Vision Transformer (ViT), Contrastive Language-Image Pre-training (CLIP), and Sigmoid loss for Language-Image Pre-training (SigLIP) embeddings for image retrieval and classification.
arXiv Detail & Related papers (2024-10-19T04:20:23Z) - SCOPE: Sign Language Contextual Processing with Embedding from LLMs [49.5629738637893]
Sign languages, used by around 70 million Deaf individuals globally, are visual languages that convey visual and contextual information.
Current methods in vision-based sign language recognition ( SLR) and translation (SLT) struggle with dialogue scenes due to limited dataset diversity and the neglect of contextually relevant information.
We introduce SCOPE, a novel context-aware vision-based SLR and SLT framework.
arXiv Detail & Related papers (2024-09-02T08:56:12Z) - ViSpeR: Multilingual Audio-Visual Speech Recognition [9.40993779729177]
This work presents an extensive and detailed study on Audio-Visual Speech Recognition for five widely spoken languages.
We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models.
Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language.
arXiv Detail & Related papers (2024-05-27T14:48:51Z) - Lightweight Syntactic API Usage Analysis with UCov [0.0]
We present a novel conceptual framework designed to assist library maintainers in understanding the interactions allowed by their APIs.
These customizable models enable library maintainers to improve their design ahead of release, reducing friction during evolution.
We implement these models for Java libraries in a new tool UCov and demonstrate its capabilities on three libraries exhibiting diverse styles of interaction.
arXiv Detail & Related papers (2024-02-19T10:33:41Z) - CodeTF: One-stop Transformer Library for State-of-the-art Code LLM [72.1638273937025]
We present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence.
Our library supports a collection of pretrained Code LLM models and popular code benchmarks.
We hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering.
arXiv Detail & Related papers (2023-05-31T05:24:48Z) - VisionLLM: Large Language Model is also an Open-Ended Decoder for
Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions.
Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Leveraging Language to Learn Program Abstractions and Search Heuristics [66.28391181268645]
We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis.
When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization.
arXiv Detail & Related papers (2021-06-18T15:08:47Z) - CLEVR Parser: A Graph Parser Library for Geometric Learning on Language
Grounded Image Scenes [2.750124853532831]
CLEVR dataset has been used extensively in language grounded visual reasoning in Machine Learning (ML) and Natural Language Processing (NLP) domains.
We present a graph library for CLEVR that provides functionalities for object-centric attributes and relationships extraction, and construction of structural graph representations for dual modalities.
We discuss downstream usage and applications of the library, and how it accelerates research for the NLP research community.
arXiv Detail & Related papers (2020-09-19T03:32:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.