Related papers: Demo of the Linguistic Field Data Management and Analysis System -- LiFE

Demo of the Linguistic Field Data Management and Analysis System -- LiFE

URL: http://arxiv.org/abs/2203.11443v1
Date: Tue, 22 Mar 2022 03:34:10 GMT
Title: Demo of the Linguistic Field Data Management and Analysis System -- LiFE
Authors: Siddharth Singh and Ritesh Kumar and Shyam Ratan and Sonal Sinha
Abstract summary: LiFE is an open-source, web-based linguistic data management and analysis application. It allows users to store lexical items, sentences, paragraphs, audio-visual content with rich glossing / annotation. It generates interactive and print dictionaries; and also train and use natural language processing tools and models.
Score: 1.2139158398361864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the proposed demo, we will present a new software - Linguistic Field Data Management and Analysis System - LiFE (https://github.com/kmi-linguistics/life) - an open-source, web-based linguistic data management and analysis application that allows for systematic storage, management, sharing and usage of linguistic data collected from the field. The application allows users to store lexical items, sentences, paragraphs, audio-visual content with rich glossing / annotation; generate interactive and print dictionaries; and also train and use natural language processing tools and models for various purposes using this data. Since its a web-based application, it also allows for seamless collaboration among multiple persons and sharing the data, models, etc with each other. The system uses the Python-based Flask framework and MongoDB in the backend and HTML, CSS and Javascript at the frontend. The interface allows creation of multiple projects that could be shared with the other users. At the backend, the application stores the data in RDF format so as to allow its release as Linked Data over the web using semantic web technologies - as of now it makes use of the OntoLex-Lemon for storing the lexical data and Ligt for storing the interlinear glossed text and then internally linking it to the other linked lexicons and databases such as DBpedia and WordNet. Furthermore it provides support for training the NLP systems using scikit-learn and HuggingFace Transformers libraries as well as make use of any model trained using these libraries - while the user interface itself provides limited options for tuning the system, an externally-trained model could be easily incorporated within the application; similarly the dataset itself could be easily exported into a standard machine-readable format like JSON or CSV that could be consumed by other programs and pipelines.

Related papers

Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline [0.0]
This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents.<n>The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil.<n>The current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments.
arXiv Detail & Related papers (2025-05-16T12:20:37Z)
Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation [1.7436854281619139]
We introduce a pipeline for generating Open-Domain Dialogue data in multiple Target Languages using Large Language Models. To enhance the openness of generated dialogues and mimic real life scenarii, we added the notion of speech events corresponding to the type of conversation the speakers are involved in.
arXiv Detail & Related papers (2025-03-05T12:52:14Z)
Training Large Recommendation Models via Graph-Language Token Alignment [53.3142545812349]
We propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment. By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs. Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction.
arXiv Detail & Related papers (2025-02-26T02:19:10Z)
Statically Contextualizing Large Language Models with Typed Holes [4.180458188910334]
Large language models (LLMs) have reshaped the landscape of program synthesis. LLMs often hallucinate broken code because they lack appropriate context. This paper demonstrates that tight integration with the type and binding structure of a language can address this contextualization problem.
arXiv Detail & Related papers (2024-09-02T03:29:00Z)
Text-like Encoding of Collaborative Information in Large Language Models for Recommendation [58.87865271693269]
We introduce BinLLM, a novel method to seamlessly integrate collaborative information with Large Language Models for Recommendation (LLMRec) BinLLM converts collaborative embeddings from external models into binary sequences. BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths.
arXiv Detail & Related papers (2024-06-05T12:45:25Z)
CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z)
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z)
Learning from What is Already Out There: Few-shot Sign Language Recognition with Online Dictionaries [0.0]
We open-source the UWB-SL-Wild few-shot dataset, the first of its kind training resource consisting of dictionary-scraped videos. We introduce a novel approach to training sign language recognition models in a few-shot scenario, resulting in state-of-the-art results.
arXiv Detail & Related papers (2023-01-10T03:21:01Z)
Offline RL for Natural Language Generation with Implicit Language Q Learning [87.76695816348027]
Large language models can be inconsistent when it comes to completing user specified tasks. We propose a novel RL method, that combines both the flexible utility framework of RL with the ability of supervised learning. In addition to empirically validating ILQL, we present a detailed empirical analysis situations where offline RL can be useful in natural language generation settings.
arXiv Detail & Related papers (2022-06-05T18:38:42Z)
Using Document Similarity Methods to create Parallel Datasets for Code Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task. We propose to use document similarity methods to create noisy parallel datasets of code. We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z)
Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models. We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks. We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z)
Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches [0.0]
This paper compares different deep learning architectures to create and use language models based on programming code. We discuss each approach's different strengths and weaknesses and what gaps we find to evaluate the language models or apply them in a real programming context.
arXiv Detail & Related papers (2020-09-16T15:17:04Z)
Efficient Deployment of Conversational Natural Language Interfaces over Databases [45.52672694140881]
We propose a novel method for accelerating the training dataset collection for developing the natural language-to-query-language machine learning models. Our system allows one to generate conversational multi-term data, where multiple turns define a dialogue session.
arXiv Detail & Related papers (2020-05-31T19:16:27Z)
Language-agnostic Multilingual Modeling [23.06484126933893]
We build a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer. We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.
arXiv Detail & Related papers (2020-04-20T18:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.