Demo of the Linguistic Field Data Management and Analysis System -- LiFE
- URL: http://arxiv.org/abs/2203.11443v1
- Date: Tue, 22 Mar 2022 03:34:10 GMT
- Title: Demo of the Linguistic Field Data Management and Analysis System -- LiFE
- Authors: Siddharth Singh and Ritesh Kumar and Shyam Ratan and Sonal Sinha
- Abstract summary: LiFE is an open-source, web-based linguistic data management and analysis application.
It allows users to store lexical items, sentences, paragraphs, audio-visual content with rich glossing / annotation.
It generates interactive and print dictionaries; and also train and use natural language processing tools and models.
- Score: 1.2139158398361864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the proposed demo, we will present a new software - Linguistic Field Data
Management and Analysis System - LiFE (https://github.com/kmi-linguistics/life)
- an open-source, web-based linguistic data management and analysis application
that allows for systematic storage, management, sharing and usage of linguistic
data collected from the field. The application allows users to store lexical
items, sentences, paragraphs, audio-visual content with rich glossing /
annotation; generate interactive and print dictionaries; and also train and use
natural language processing tools and models for various purposes using this
data. Since its a web-based application, it also allows for seamless
collaboration among multiple persons and sharing the data, models, etc with
each other.
The system uses the Python-based Flask framework and MongoDB in the backend
and HTML, CSS and Javascript at the frontend. The interface allows creation of
multiple projects that could be shared with the other users. At the backend,
the application stores the data in RDF format so as to allow its release as
Linked Data over the web using semantic web technologies - as of now it makes
use of the OntoLex-Lemon for storing the lexical data and Ligt for storing the
interlinear glossed text and then internally linking it to the other linked
lexicons and databases such as DBpedia and WordNet. Furthermore it provides
support for training the NLP systems using scikit-learn and HuggingFace
Transformers libraries as well as make use of any model trained using these
libraries - while the user interface itself provides limited options for tuning
the system, an externally-trained model could be easily incorporated within the
application; similarly the dataset itself could be easily exported into a
standard machine-readable format like JSON or CSV that could be consumed by
other programs and pipelines.
Related papers
- Statically Contextualizing Large Language Models with Typed Holes [4.180458188910334]
Large language models (LLMs) have reshaped the landscape of program synthesis.
LLMs often hallucinate broken code because they lack appropriate context.
This paper demonstrates that tight integration with the type and binding structure of a language can address this contextualization problem.
arXiv Detail & Related papers (2024-09-02T03:29:00Z) - Text-like Encoding of Collaborative Information in Large Language Models for Recommendation [58.87865271693269]
We introduce BinLLM, a novel method to seamlessly integrate collaborative information with Large Language Models for Recommendation (LLMRec)
BinLLM converts collaborative embeddings from external models into binary sequences.
BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths.
arXiv Detail & Related papers (2024-06-05T12:45:25Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Learning from What is Already Out There: Few-shot Sign Language
Recognition with Online Dictionaries [0.0]
We open-source the UWB-SL-Wild few-shot dataset, the first of its kind training resource consisting of dictionary-scraped videos.
We introduce a novel approach to training sign language recognition models in a few-shot scenario, resulting in state-of-the-art results.
arXiv Detail & Related papers (2023-01-10T03:21:01Z) - Offline RL for Natural Language Generation with Implicit Language Q
Learning [87.76695816348027]
Large language models can be inconsistent when it comes to completing user specified tasks.
We propose a novel RL method, that combines both the flexible utility framework of RL with the ability of supervised learning.
In addition to empirically validating ILQL, we present a detailed empirical analysis situations where offline RL can be useful in natural language generation settings.
arXiv Detail & Related papers (2022-06-05T18:38:42Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Automated Source Code Generation and Auto-completion Using Deep
Learning: Comparing and Discussing Current Language-Model-Related Approaches [0.0]
This paper compares different deep learning architectures to create and use language models based on programming code.
We discuss each approach's different strengths and weaknesses and what gaps we find to evaluate the language models or apply them in a real programming context.
arXiv Detail & Related papers (2020-09-16T15:17:04Z) - Efficient Deployment of Conversational Natural Language Interfaces over
Databases [45.52672694140881]
We propose a novel method for accelerating the training dataset collection for developing the natural language-to-query-language machine learning models.
Our system allows one to generate conversational multi-term data, where multiple turns define a dialogue session.
arXiv Detail & Related papers (2020-05-31T19:16:27Z) - Language-agnostic Multilingual Modeling [23.06484126933893]
We build a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer.
We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.
arXiv Detail & Related papers (2020-04-20T18:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.