Beyond Language Models: Byte Models are Digital World Simulators
- URL: http://arxiv.org/abs/2402.19155v1
- Date: Thu, 29 Feb 2024 13:38:07 GMT
- Title: Beyond Language Models: Byte Models are Digital World Simulators
- Authors: Shangda Wu, Xu Tan, Zili Wang, Rui Wang, Xiaobing Li, Maosong Sun
- Abstract summary: bGPT is a model with next byte prediction to simulate the digital world.
It matches specialized models in performance across various modalities, including text, audio, and images.
It has almost flawlessly replicated the process of converting symbolic music data, achieving a low error rate of 0.0011 bits per byte.
- Score: 68.91268999567473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional deep learning often overlooks bytes, the basic units of the
digital world, where all forms of information and operations are encoded and
manipulated in binary format. Inspired by the success of next token prediction
in natural language processing, we introduce bGPT, a model with next byte
prediction to simulate the digital world. bGPT matches specialized models in
performance across various modalities, including text, audio, and images, and
offers new possibilities for predicting, simulating, and diagnosing algorithm
or hardware behaviour. It has almost flawlessly replicated the process of
converting symbolic music data, achieving a low error rate of 0.0011 bits per
byte in converting ABC notation to MIDI format. In addition, bGPT demonstrates
exceptional capabilities in simulating CPU behaviour, with an accuracy
exceeding 99.99% in executing various operations. Leveraging next byte
prediction, models like bGPT can directly learn from vast binary data,
effectively simulating the intricate patterns of the digital world.
Related papers
- Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text [59.57676466961787]
We propose a novel dynamic vector quantization (DVA-VAE) model that can adjust the encoding length based on the information density in sign language.
Experiments conducted on the PHOENIX14T dataset demonstrate the effectiveness of our proposed method.
We propose a new large German sign language dataset, PHOENIX-News, which contains 486 hours of sign language videos, audio, and transcription texts.
arXiv Detail & Related papers (2024-06-11T10:06:53Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [78.85346970193518]
Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes.
Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling.
Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
arXiv Detail & Related papers (2023-05-12T00:55:41Z) - Adapting Transformer Language Models for Predictive Typing in
Brain-Computer Interfaces [3.3961243538813837]
This paper adapts several wordpiece-level Transformer LMs to make character predictions and evaluates them on typing tasks.
GPT-2 fares best on clean text, but different LMs react differently to noisy histories.
arXiv Detail & Related papers (2023-05-05T19:47:41Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Hidden Schema Networks [3.4123736336071864]
We introduce a novel neural language model that enforces, via inductive biases, explicit relational structures.
The model encodes sentences into sequences of symbols, which correspond to nodes visited by biased random walkers.
We show that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences.
arXiv Detail & Related papers (2022-07-08T09:26:19Z) - Neural Machine Translation without Embeddings [44.129310924201604]
Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and subword induction algorithms.
A simple universal alternative is to represent every computerized text as a sequence of bytes via-8.
Experiments on byteto-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models.
arXiv Detail & Related papers (2020-08-21T09:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.