Semantic Similarity Matching for Patent Documents Using Ensemble
BERT-related Model and Novel Text Processing Method
- URL: http://arxiv.org/abs/2401.06782v1
- Date: Sat, 6 Jan 2024 02:35:49 GMT
- Title: Semantic Similarity Matching for Patent Documents Using Ensemble
BERT-related Model and Novel Text Processing Method
- Authors: Liqiang Yu, Bo Liu, Qunwei Lin, Xinyu Zhao, Chang Che
- Abstract summary: This paper introduces an ensemble approach that incorporates four BERT-related models, enhancing semantic similarity accuracy through weighted averaging.
Secondly, a novel text preprocessing method tailored for patent documents is introduced, featuring a distinctive input structure with token scoring that aids capturing semantic relationships during CPC context training.
- Score: 4.313626569907121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the realm of patent document analysis, assessing semantic similarity
between phrases presents a significant challenge, notably amplifying the
inherent complexities of Cooperative Patent Classification (CPC) research.
Firstly, this study addresses these challenges, recognizing early CPC work
while acknowledging past struggles with language barriers and document
intricacy. Secondly, it underscores the persisting difficulties of CPC
research.
To overcome these challenges and bolster the CPC system, This paper presents
two key innovations. Firstly, it introduces an ensemble approach that
incorporates four BERT-related models, enhancing semantic similarity accuracy
through weighted averaging. Secondly, a novel text preprocessing method
tailored for patent documents is introduced, featuring a distinctive input
structure with token scoring that aids in capturing semantic relationships
during CPC context training, utilizing BCELoss. Our experimental findings
conclusively establish the effectiveness of both our Ensemble Model and novel
text processing strategies when deployed on the U.S. Patent Phrase to Phrase
Matching dataset.
Related papers
- Localizing Factual Inconsistencies in Attributable Text Generation [91.981439746404]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.
We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation.
We then implement several methods for automatically detecting localized factual inconsistencies.
arXiv Detail & Related papers (2024-10-09T22:53:48Z) - Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts.
We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z) - Comparing Feature-based and Context-aware Approaches to PII Generalization Level Prediction [0.6138671548064356]
PII in text data is crucial for privacy, but current generalization methods face challenges such as uneven data distributions and limited context awareness.
We propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates.
Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales.
arXiv Detail & Related papers (2024-07-03T06:32:03Z) - Mind Your Neighbours: Leveraging Analogous Instances for Rhetorical Role Labeling for Legal Documents [1.2562034805037443]
This study introduces novel techniques to enhance Rhetorical Role Labeling (RRL) performance.
For inference-based methods, we explore techniques that bolster label predictions without re-training.
While in training-based methods, we integrate learning with our novel discourse-aware contrastive method that work directly on embedding spaces.
arXiv Detail & Related papers (2024-03-31T08:10:45Z) - Cross-domain Chinese Sentence Pattern Parsing [67.1381983012038]
Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching.
Existing SPSs rely heavily on textbook corpora for training, lacking cross-domain capability.
This paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework.
arXiv Detail & Related papers (2024-02-26T05:30:48Z) - Noise Contrastive Estimation-based Matching Framework for Low-Resource
Security Attack Pattern Recognition [49.536368818512116]
Tactics, Techniques and Procedures (TTPs) represent sophisticated attack patterns in the cybersecurity domain.
We formulate the problem in a different learning paradigm, where the assignment of a text to a TTP label is decided by the direct semantic similarity between the two.
We propose a neural matching architecture with an effective sampling-based learn-to-compare mechanism.
arXiv Detail & Related papers (2024-01-18T19:02:00Z) - CLIP-based Synergistic Knowledge Transfer for Text-based Person
Retrieval [66.93563107820687]
We introduce a CLIP-based Synergistic Knowledge Transfer (CSKT) approach for Person Retrieval (TPR)
To explore the CLIP's knowledge on input side, we first propose a Bidirectional Prompts Transferring (BPT) module constructed by text-to-image and image-to-text bidirectional prompts and coupling projections.
CSKT outperforms the state-of-the-art approaches across three benchmark datasets when the training parameters merely account for 7.4% of the entire model.
arXiv Detail & Related papers (2023-09-18T05:38:49Z) - From Judgement's Premises Towards Key Points [1.648438955311779]
Key Point Analysis is a relatively new task in NLP that combines summarization and classification.
We focus on the legal domain and develop methods that identify and extract KPs from premises derived from texts of judgments.
arXiv Detail & Related papers (2022-12-23T10:20:58Z) - PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense
Passage Retrieval [87.68667887072324]
We propose a novel approach that leverages query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval.
To implement our approach, we make three major technical contributions by introducing formal formulations of the two kinds of similarity relations.
Our approach significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets.
arXiv Detail & Related papers (2021-08-13T02:07:43Z) - Cross-lingual Word Sense Disambiguation using mBERT Embeddings with
Syntactic Dependencies [0.0]
Cross-lingual word sense disambiguation (WSD) tackles the challenge of disambiguating ambiguous words across languages given context.
BERT embedding model has been proven to be effective in contextual information of words.
This project investigates how syntactic information can be added into the BERT embeddings to result in both semantics- and syntax-incorporated word embeddings.
arXiv Detail & Related papers (2020-12-09T20:22:11Z) - Exploring Cross-sentence Contexts for Named Entity Recognition with BERT [1.4998865865537996]
We present a study exploring the use of cross-sentence information for NER using BERT models in five languages.
We find that adding context in the form of additional sentences to BERT input increases NER performance on all of the tested languages and models.
We propose a straightforward method, Contextual Majority Voting (CMV), to combine different predictions for sentences and demonstrate this to further increase NER performance with BERT.
arXiv Detail & Related papers (2020-06-02T12:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.