SCITUNE: Aligning Large Language Models with Scientific Multimodal
Instructions
- URL: http://arxiv.org/abs/2307.01139v1
- Date: Mon, 3 Jul 2023 16:25:49 GMT
- Title: SCITUNE: Aligning Large Language Models with Scientific Multimodal
Instructions
- Authors: Sameera Horawalavithana, Sai Munikoti, Ian Stewart, Henry Kvinge
- Abstract summary: In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow scientific multimodal instructions.
To test our methodology, we use a human-generated scientific instruction tuning dataset and train a large multimodal model LLaMA-SciTune.
In comparison to the models that are finetuned with machine generated data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark.
- Score: 0.7264378254137809
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction finetuning is a popular paradigm to align large language models
(LLM) with human intent. Despite its popularity, this idea is less explored in
improving the LLMs to align existing foundation models with scientific
disciplines, concepts and goals. In this work, we present SciTune as a tuning
framework to improve the ability of LLMs to follow scientific multimodal
instructions. To test our methodology, we use a human-generated scientific
instruction tuning dataset and train a large multimodal model LLaMA-SciTune
that connects a vision encoder and LLM for science-focused visual and language
understanding. In comparison to the models that are finetuned with machine
generated data only, LLaMA-SciTune surpasses human performance on average and
in many sub-categories on the ScienceQA benchmark.
Related papers
- Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression [40.4998607679863]
Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data.
This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps.
arXiv Detail & Related papers (2024-06-17T09:17:40Z) - MotionLLM: Multimodal Motion-Language Learning with Large Language Models [69.5875073447454]
We propose MotionLLM to achieve single-human, multi-human motion generation and motion captioning.
Specifically, we encode and quantize motions into discrete LLM-understandable tokens, which results in a unified vocabulary consisting of both motion and text tokens.
Our approach is scalable and flexible, allowing easy extension to multi-human motion generation through autoregressive generation of single-human motions.
arXiv Detail & Related papers (2024-05-27T09:57:51Z) - Can LLMs' Tuning Methods Work in Medical Multimodal Domain? [14.659849302397433]
Large Language Models (LLMs) excel in world knowledge understanding, adapting them to specific subfields requires precise adjustments.
New Parameters-Efficient Fine-Tuning (PEFT) methods have emerged and achieved remarkable success in both LLMs and Large Vision-Language Models (LVLMs)
Can the fine-tuning methods for large models be transferred to the medical field to enhance transfer learning efficiency?
arXiv Detail & Related papers (2024-03-11T03:38:48Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z) - SciGLM: Training Scientific Language Models with Self-Reflective
Instruction Annotation and Tuning [60.14510984576027]
SciGLM is a suite of scientific language models able to conduct college-level scientific reasoning.
We apply a self-reflective instruction annotation framework to generate step-by-step reasoning for unlabelled scientific questions.
We fine-tuned the ChatGLM family of language models with SciInstruct, enhancing their scientific and mathematical reasoning capabilities.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
AlignedCoT is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - Aligning Large Language Models through Synthetic Feedback [43.84431341195111]
We propose a novel alignment learning framework with synthetic feedback not dependent on extensive human annotations.
In human evaluation, our model is preferred to Alpaca and Dolly-v2, 55.0% and 58.5% of the time, respectively.
arXiv Detail & Related papers (2023-05-23T06:41:16Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.