Probing Semantic Grounding in Language Models of Code with
Representational Similarity Analysis
- URL: http://arxiv.org/abs/2207.07706v1
- Date: Fri, 15 Jul 2022 19:04:43 GMT
- Title: Probing Semantic Grounding in Language Models of Code with
Representational Similarity Analysis
- Authors: Shounak Naik, Rajaswa Patil, Swati Agarwal, Veeky Baths
- Abstract summary: We propose using Representational Similarity Analysis to probe the semantic grounding in language models of code.
We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset.
Our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code.
- Score: 0.11470070927586018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representational Similarity Analysis is a method from cognitive neuroscience,
which helps in comparing representations from two different sources of data. In
this paper, we propose using Representational Similarity Analysis to probe the
semantic grounding in language models of code. We probe representations from
the CodeBERT model for semantic grounding by using the data from the IBM
CodeNet dataset. Through our experiments, we show that current pre-training
methods do not induce semantic grounding in language models of code, and
instead focus on optimizing form-based patterns. We also show that even a
little amount of fine-tuning on semantically relevant tasks increases the
semantic grounding in CodeBERT significantly. Our ablations with the input
modality to the CodeBERT model show that using bimodal inputs (code and natural
language) over unimodal inputs (only code) gives better semantic grounding and
sample efficiency during semantic fine-tuning. Finally, our experiments with
semantic perturbations in code reveal that CodeBERT is able to robustly
distinguish between semantically correct and incorrect code.
Related papers
- A test-free semantic mistakes localization framework in Neural Code Translation [32.5036379897325]
We present EISP, a static analysis framework based on the Large Language Model (LLM)
The framework generates a semantic mapping between source code and translated code.
EISP connects each pair of sub-code fragments with fine-grained knowledge hints through an AI chain.
arXiv Detail & Related papers (2024-10-30T08:53:33Z) - Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text.
This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model.
We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z) - Agentivit\`a e telicit\`a in GilBERTo: implicazioni cognitive [77.71680953280436]
The goal of this study is to investigate whether a Transformer-based neural language model infers lexical semantics.
The semantic properties considered are telicity (also combined with definiteness) and agentivity.
arXiv Detail & Related papers (2023-07-06T10:52:22Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Constructing Word-Context-Coupled Space Aligned with Associative
Knowledge Relations for Interpretable Language Modeling [0.0]
The black-box structure of the deep neural network in pre-trained language models seriously limits the interpretability of the language modeling process.
A Word-Context-Coupled Space (W2CSpace) is proposed by introducing the alignment processing between uninterpretable neural representation and interpretable statistical logic.
Our language model can achieve better performance and highly credible interpretable ability compared to related state-of-the-art methods.
arXiv Detail & Related papers (2023-05-19T09:26:02Z) - Towards Computationally Verifiable Semantic Grounding for Language
Models [18.887697890538455]
The paper conceptualizes the LM as a conditional model generating text given a desired semantic message formalized as a set of entity-relationship triples.
It embeds the LM in an auto-encoder by feeding its output to a semantic fluency whose output is in the same representation domain as the input message.
We show that our proposed approaches significantly improve on the greedy search baseline.
arXiv Detail & Related papers (2022-11-16T17:35:52Z) - Few-Shot Semantic Parsing with Language Models Trained On Code [52.23355024995237]
We find that Codex performs better at semantic parsing than equivalent GPT-3 models.
We find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps as meaning representations used in semantic parsing are structured similar to code.
arXiv Detail & Related papers (2021-12-16T08:34:06Z) - Multimodal Representation for Neural Code Search [18.371048875103497]
We introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data.
Our results show that both our tree-serialized representations and multimodal learning model improve the performance of neural code search.
arXiv Detail & Related papers (2021-07-02T12:08:19Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical
Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change.
SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z) - Logic Constrained Pointer Networks for Interpretable Textual Similarity [11.142649867439406]
We introduce a novel pointer network based model with a sentinel gating function to align constituent chunks.
We improve this base model with a loss function to equally penalize misalignments in both sentences, ensuring the alignments are bidirectional.
The model achieves an F1 score of 97.73 and 96.32 on the benchmark SemEval datasets for the chunk alignment task.
arXiv Detail & Related papers (2020-07-15T13:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.