Zero-Shot Text Matching for Automated Auditing using Sentence
Transformers
- URL: http://arxiv.org/abs/2211.07716v1
- Date: Fri, 28 Oct 2022 11:52:16 GMT
- Title: Zero-Shot Text Matching for Automated Auditing using Sentence
Transformers
- Authors: David Biesner, Maren Pielka, Rajkumar Ramamurthy, Tim Dilmaghani,
Bernd Kliem, R\"udiger Loitz, Rafet Sifa
- Abstract summary: We study the efficiency of unsupervised text matching using Sentence-Bert, a transformer-based model, by applying it to the semantic similarity of financial passages.
Experimental results show that this model is robust to documents from in- and out-of-domain data.
- Score: 0.3078691410268859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language processing methods have several applications in automated
auditing, including document or passage classification, information retrieval,
and question answering. However, training such models requires a large amount
of annotated data which is scarce in industrial settings. At the same time,
techniques like zero-shot and unsupervised learning allow for application of
models pre-trained using general domain data to unseen domains.
In this work, we study the efficiency of unsupervised text matching using
Sentence-Bert, a transformer-based model, by applying it to the semantic
similarity of financial passages. Experimental results show that this model is
robust to documents from in- and out-of-domain data.
Related papers
- Self-Train Before You Transcribe [3.17829719401032]
We investigate the benefit of performing noisy student teacher training on recordings in the test set as a test-time adaptation approach.
A range of in-domain and out-of-domain datasets are used for experiments demonstrating large relative gains of up to 32.2%.
arXiv Detail & Related papers (2024-06-17T09:21:00Z) - Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - Domain-Specific NER via Retrieving Correlated Samples [37.98414661072985]
In this paper, we suggest enhancing NER models with correlated samples.
To explicitly simulate the human reasoning process, we perform a training-free entity type calibrating by majority voting.
Empirical results on datasets of the above two domains show the efficacy of our methods.
arXiv Detail & Related papers (2022-08-27T12:25:24Z) - Actuarial Applications of Natural Language Processing Using
Transformers: Case Studies for Using Text Features in an Actuarial Context [0.0]
This tutorial demonstrates to incorporate text data into actuarial classification and regression tasks.
The main focus is on methods employing transformer-based models.
The case studies tackle challenges related to a multi-lingual setting and long input sequences.
arXiv Detail & Related papers (2022-06-04T15:39:30Z) - Multiple-Source Domain Adaptation via Coordinated Domain Encoders and
Paired Classifiers [1.52292571922932]
We present a novel model for text classification under domain shift.
It exploits the update representations to dynamically integrate domain encoders.
It also employs a probabilistic model to infer the error rate in the target domain.
arXiv Detail & Related papers (2022-01-28T00:50:01Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Interpretable Entity Representations through Large-Scale Typing [61.4277527871572]
We present an approach to creating entity representations that are human readable and achieve high performance out of the box.
Our representations are vectors whose values correspond to posterior probabilities over fine-grained entity types.
We show that it is possible to reduce the size of our type set in a learning-based way for particular domains.
arXiv Detail & Related papers (2020-04-30T23:58:03Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.