The boundaries of meaning: a case study in neural machine translation
- URL: http://arxiv.org/abs/2210.00613v1
- Date: Sun, 2 Oct 2022 20:26:20 GMT
- Title: The boundaries of meaning: a case study in neural machine translation
- Authors: Yuri Balashov
- Abstract summary: Subword segmentation algorithms are widely employed in language modeling, machine translation, and other tasks since 2016.
These algorithms often cut words into semantically opaque pieces, such as 'period', 'on', 't', and 'ist'
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The success of deep learning in natural language processing raises intriguing
questions about the nature of linguistic meaning and ways in which it can be
processed by natural and artificial systems. One such question has to do with
subword segmentation algorithms widely employed in language modeling, machine
translation, and other tasks since 2016. These algorithms often cut words into
semantically opaque pieces, such as 'period', 'on', 't', and 'ist' in
'period|on|t|ist'. The system then represents the resulting segments in a dense
vector space, which is expected to model grammatical relations among them. This
representation may in turn be used to map 'period|on|t|ist' (English) to
'par|od|ont|iste' (French). Thus, instead of being modeled at the lexical
level, translation is reformulated more generally as the task of learning the
best bilingual mapping between the sequences of subword segments of two
languages; and sometimes even between pure character sequences:
'p|e|r|i|o|d|o|n|t|i|s|t' $\rightarrow$ 'p|a|r|o|d|o|n|t|i|s|t|e'. Such subword
segmentations and alignments are at work in highly efficient end-to-end machine
translation systems, despite their allegedly opaque nature. The computational
value of such processes is unquestionable. But do they have any linguistic or
philosophical plausibility? I attempt to cast light on this question by
reviewing the relevant details of the subword segmentation algorithms and by
relating them to important philosophical and linguistic debates, in the spirit
of making artificial intelligence more transparent and explainable.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Word class representations spontaneously emerge in a deep neural network
trained on next word prediction [7.240611820374677]
How do humans learn language, and can the first language be learned at all?
These fundamental questions are still hotly debated.
In particular, we train an artificial deep neural network on predicting the next word.
We find that the internal representations of nine-word input sequences cluster according to the word class of the tenth word to be predicted as output.
arXiv Detail & Related papers (2023-02-15T11:02:50Z) - On the Role of Morphological Information for Contextual Lemmatization [7.106986689736827]
We investigate the role of morphological information to develop contextual lemmatizers in six languages.
Basque, Turkish, Russian, Czech, Spanish and English.
Experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology.
arXiv Detail & Related papers (2023-02-01T12:47:09Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Context based lemmatizer for Polish language [0.0]
Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item.
In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.
The model achieves the best results for polish language lemmatisation process.
arXiv Detail & Related papers (2022-07-23T18:02:16Z) - A Paradigm Change for Formal Syntax: Computational Algorithms in the
Grammar of English [0.0]
We turn to programming languages as models for a process-based syntax of English.
The combination of a functional word and a content word was chosen as the topic of modeling.
The fit of the model was tested by deriving three functional characteristics crucial for the algorithm and checking their presence in English grammar.
arXiv Detail & Related papers (2022-05-24T07:28:47Z) - Generalized Optimal Linear Orders [9.010643838773477]
The sequential structure of language, and the order of words in a sentence specifically, plays a central role in human language processing.
In designing computational models of language, the de facto approach is to present sentences to machines with the words ordered in the same order as in the original human-authored sentence.
The very essence of this work is to question the implicit assumption that this is desirable and inject theoretical soundness into the consideration of word order in natural language processing.
arXiv Detail & Related papers (2021-08-13T13:10:15Z) - Provable Limitations of Acquiring Meaning from Ungrounded Form: What
will Future Language Models Understand? [87.20342701232869]
We investigate the abilities of ungrounded systems to acquire meaning.
We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence.
We find that assertions enable semantic emulation if all expressions in the language are referentially transparent.
However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem.
arXiv Detail & Related papers (2021-04-22T01:00:17Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.