Related papers: BriLLM: Brain-inspired Large Language Model

Related papers

Synergy: End-to-end Concept Model [0.0]
We present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion.<n>Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair tokenizers.
arXiv Detail & Related papers (2025-07-17T04:01:28Z)
Token Communication in the Era of Large Models: An Information Bottleneck-Based Approach [55.861432910722186]
UniToCom is a unified token communication paradigm that treats tokens as the fundamental units for both processing and wireless transmission.<n>We propose a generative information bottleneck (GenIB) principle, which facilitates the learning of tokens that preserve essential information.<n>We employ a causal Transformer-based multimodal large language model (MLLM) at the receiver to unify the processing of both discrete and continuous tokens.
arXiv Detail & Related papers (2025-07-02T14:03:01Z)
Neural Networks as Universal Finite-State Machines: A Constructive Feedforward Simulation Framework for NFAs [0.0]
This work establishes a new bridge between symbolic automata theory and modern neural architectures.<n>We show that feedforward networks can perform precise, interpretable, and trainable symbolic computation.
arXiv Detail & Related papers (2025-05-30T01:18:35Z)
Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding [55.2480439325792]
In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution.<n>We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution.<n>We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
arXiv Detail & Related papers (2025-04-29T06:33:13Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models [4.7936447642295406]
In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language modelworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigramworks can be found in fully trained language models up to 1B parameters, and theseworks are critical for model performance even when they consist of less than 0.2% of model parameters.
arXiv Detail & Related papers (2025-04-21T22:41:00Z)
Recurrent Diffusion for Large-Scale Parameter Generation [52.98888368644455]
We introduce Recurrent Diffusion for Large Scale Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU.<n>RPG serves as a critical advance in AI generating AI, potentially enabling efficient weight generation at scales previously deemed infeasible.
arXiv Detail & Related papers (2025-01-20T16:46:26Z)
Concept Bottleneck Language Models For protein design [33.62561223760279]
We introduce Concept Bottleneck Protein Language Models (CB-pLM)<n>CB-pLM is a generative masked language model with a layer where each neuron corresponds to an interpretable concept.<n>We scale our CB-pLM from 24 million to 3 billion parameters, making them the largest Concept Bottleneck Models trained and the first capable of generative language modeling.
arXiv Detail & Related papers (2024-11-09T06:46:16Z)
Interpretable Language Modeling via Induction-head Ngram Models [74.26720927767398]
We propose Induction-head ngram models (Induction-Gram) to bolster modern ngram models with a hand-engineered "induction head" This induction head uses a custom neural similarity metric to efficiently search the model's input context for potential next-word completions. Experiments show that this simple method significantly improves next-word prediction over baseline interpretable models.
arXiv Detail & Related papers (2024-10-31T12:33:26Z)
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs) We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z)
This Probably Looks Exactly Like That: An Invertible Prototypical Network [8.957872207471311]
Prototypical neural networks represent an exciting way forward in realizing human-comprehensible machine learning without concept annotations. We find that reliance on indirect interpretation functions for prototypical explanations imposes a severe limit on prototypes' informative power. We propose one such model, called ProtoFlow, by composing a normalizing flow with Gaussian mixture models.
arXiv Detail & Related papers (2024-07-16T21:51:02Z)
Power Failure Cascade Prediction using Graph Neural Networks [4.667031410586657]
We propose a flow-free model that predicts grid states at every generation of a cascade process given an initial contingency and power injection values. We show that the proposed model reduces the computational time by almost two orders of magnitude.
arXiv Detail & Related papers (2024-04-24T18:45:50Z)
Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks [12.7259425362286]
We investigate how multilingual models might leverage key-value memories. For autoregressive models trained on two or more languages, do all neurons (across layers) respond equally to all languages? Our findings reveal that the layers closest to the network's input or output tend to exhibit more language-specific behaviour compared to the layers in the middle.
arXiv Detail & Related papers (2023-10-24T06:45:00Z)
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models. We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method. We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z)
Residual Learning of Neural Text Generation with $n$-gram Language Model [41.26228768053928]
We learn a neural LM that fits the residual between an $n$-gram LM and the real-data distribution. Our approach attains additional performance gains over popular standalone neural models consistently.
arXiv Detail & Related papers (2022-10-26T02:42:53Z)
Augmenting Interpretable Models with LLMs during Training [73.40079895413861]
We propose Augmented Interpretable Models (Aug-imodels) to build efficient and interpretable models. Aug-imodels use LLMs during fitting but not during inference, allowing complete transparency. We explore two instantiations of Aug-imodels in natural-language processing: (i) Aug-GAM, which augments a generalized additive model with decoupled embeddings from an LLM and (ii) Aug-Tree, which augments a decision tree with LLM feature expansions.
arXiv Detail & Related papers (2022-09-23T18:36:01Z)
Hidden Schema Networks [3.4123736336071864]
We introduce a novel neural language model that enforces, via inductive biases, explicit relational structures. The model encodes sentences into sequences of symbols, which correspond to nodes visited by biased random walkers. We show that the model is able to uncover ground-truth graphs from artificially generated datasets of random token sequences.
arXiv Detail & Related papers (2022-07-08T09:26:19Z)
GN-Transformer: Fusing Sequence and Graph Representation for Improved Code Summarization [0.0]
We propose a novel method, GN-Transformer, to learn end-to-end on a fused sequence and graph modality. The proposed methods achieve state-of-the-art performance in two code summarization datasets and across three automatic code summarization metrics.
arXiv Detail & Related papers (2021-11-17T02:51:37Z)
Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT. We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z)
Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition. How to effectively model linguistic rules in end-to-end deep networks remains a research challenge. We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)
Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long. We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay. Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.