HuixiangDou-CR: Coreference Resolution in Group Chats
- URL: http://arxiv.org/abs/2405.02817v1
- Date: Sun, 5 May 2024 05:43:20 GMT
- Title: HuixiangDou-CR: Coreference Resolution in Group Chats
- Authors: Huanjun Kong,
- Abstract summary: In this work, we have preprocessed 58k authentic chat data and manually annotated 2.3k questions.
We conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters.
This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: How to eliminate pronominal reference in group chats? In this work, we have preprocessed 58k authentic chat data and manually annotated 2.3k questions. The reliability of this annotation was confirmed by the scaling law. After this, we conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters. The optimal version improved 29.07 in F1 score. This confirms the viability of fine-tuning Large Language Model (LLM) for downstream Natural Language Processing (NLP) tasks. Our contributions are: 1) Created Supervised Fine-Tuning (SFT) training data in alpaca format, along with a set of Low-Rank Adaptation (LoRA) weights, and 2) Developed a method for acquiring high-quality data leveraging scaling law principle. The script, raw data with alpaca format and experiments track are open-sourced on Github https://github.com/InternLM/HuixiangDou/tree/main/web/tools, HuggingFace https://huggingface.co/tpoisonooo and WandB https://wandb.ai/tpoisonooo/huixiangdou-cr/table?nw=nwusertpoisonooo . The privacy of the data involved has been authorized by users.
Related papers
- OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data [0.0]
In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models.
We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model.
We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme.
arXiv Detail & Related papers (2024-04-18T13:57:18Z) - Unify word-level and span-level tasks: NJUNLP's Participation for the
WMT2023 Quality Estimation Shared Task [59.46906545506715]
We introduce the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task.
Our team submitted predictions for the English-German language pair on all two sub-tasks.
Our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks.
arXiv Detail & Related papers (2023-09-23T01:52:14Z) - Deep Learning Approach for Classifying the Aggressive Comments on Social
Media: Machine Translated Data Vs Real Life Data [15.813222387547357]
This paper particularly worked on the Hindi, Bangla, and English datasets to detect aggressive comments.
A fully machine-translated English dataset has been analyzed with the models such as the Long Short term memory model (LSTM), Bidirectional Long-short term memory model (BiLSTM), word2vec, Bidirectional Representations from Transformers (BERT), and generative pre-trained transformer (GPT-2)
We have compared the performance of using the noisy data with two more datasets such as raw data, which does not contain any noises, and semi-noisy data, which contains a certain amount of noisy data.
arXiv Detail & Related papers (2023-03-13T21:43:08Z) - Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be
Consistent [97.64313409741614]
We propose to enforce a emphconsistency property which states that predictions of the model on its own generated data are consistent across time.
We show that our novel training objective yields state-of-the-art results for conditional and unconditional generation in CIFAR-10 and baseline improvements in AFHQ and FFHQ.
arXiv Detail & Related papers (2023-02-17T18:45:04Z) - Enhancing Self-Consistency and Performance of Pre-Trained Language
Models through Natural Language Inference [72.61732440246954]
Large pre-trained language models often lack logical consistency across test inputs.
We propose a framework, ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models.
We show that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models.
arXiv Detail & Related papers (2022-11-21T21:58:30Z) - Content Popularity Prediction Based on Quantized Federated Bayesian
Learning in Fog Radio Access Networks [76.16527095195893]
We investigate the content popularity prediction problem in cache-enabled fog radio access networks (F-RANs)
In order to predict the content popularity with high accuracy and low complexity, we propose a Gaussian process based regressor to model the content request pattern.
We utilize Bayesian learning to train the model parameters, which is robust to overfitting.
arXiv Detail & Related papers (2022-06-23T03:05:12Z) - An Improved Normed-Deformable Convolution for Crowd Counting [70.02434289611566]
Deformable convolution is proposed to exploit the scale-adaptive capabilities for CNN features in the heads.
An improved Normed-Deformable Convolution (textiti.e.,NDConv) is proposed in this paper.
Our method outperforms state-of-the-art methods on ShanghaiTech A, ShanghaiTech B, UCF_QNRF, and UCF_CC_50 dataset.
arXiv Detail & Related papers (2022-06-16T10:56:26Z) - PSG@HASOC-Dravidian CodeMixFIRE2021: Pretrained Transformers for
Offensive Language Identification in Tanglish [0.0]
This paper describes the system submitted to Dravidian-Codemix-HASOC2021: Hate Speech and Offensive Language Identification in Dravidian languages.
This task aims to identify offensive content in code-mixed comments/posts in Dravidian languages collected from social media.
arXiv Detail & Related papers (2021-10-06T15:23:40Z) - Identifying non-natural language artifacts in bug reports [1.464410818828473]
We present a machine learning based approach to classify content into natural language and artifacts at line level in Python.
We show how data from GitHub issue trackers can be used for automated training set generation.
Our model scores at 0.95 ROC-AUC and 0.93 F1 against our manually annotated validation set, and classifies 10k lines in 0.72 seconds.
arXiv Detail & Related papers (2021-10-04T11:33:51Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z) - [Re] Don't Judge an Object by Its Context: Learning to Overcome
Contextual Bias [15.701707809084715]
We implement the entire pipeline from scratch in PyTorch 1.7.0.
We find that both proposed methods in the original paper help mitigate contextual bias.
arXiv Detail & Related papers (2021-04-28T06:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.