Related papers: AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework

AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework

URL: http://arxiv.org/abs/2406.13947v1
Date: Thu, 20 Jun 2024 02:29:46 GMT
Title: AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework
Authors: Ya-Lun Li,
Abstract summary: The goal of this proposal is to develop a text de-identification framework, which can be easily adapted to the specific domain. We propose an aspect-based utility-preserved de-identification summarization framework, AspirinSum, by learning to align expert's aspect from existing comment data. We envision that the de-identified text can then be used in data publishing, eventually publishing our de-identified dataset for downstream task use.
Score: 1.9489823192518083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the rapid advancement of Large Language Model (LLM), the whole community eagerly consumes any available text data in order to train the LLM. Currently, large portion of the available text data are collected from internet, which has been thought as a cheap source of the training data. However, when people try to extend the LLM's capability to the personal related domain, such as healthcare or education, the lack of public dataset in these domains make the adaption of the LLM in such domains much slower. The reason of lacking public available dataset in such domains is because they usually contain personal sensitive information. In order to comply with privacy law, the data in such domains need to be de-identified before any kind of dissemination. It had been much research tried to address this problem for the image or tabular data. However, there was limited research on the efficient and general de-identification method for text data. Most of the method based on human annotation or predefined category list. It usually can not be easily adapted to specific domains. The goal of this proposal is to develop a text de-identification framework, which can be easily adapted to the specific domain, leverage the existing expert knowledge without further human annotation. We propose an aspect-based utility-preserved de-identification summarization framework, AspirinSum, by learning to align expert's aspect from existing comment data, it can efficiently summarize the personal sensitive document by extracting personal sensitive aspect related sub-sentence and de-identify it by substituting it with similar aspect sub-sentence. We envision that the de-identified text can then be used in data publishing, eventually publishing our de-identified dataset for downstream task use.

Related papers

Unsupervised Named Entity Disambiguation for Low Resource Domains [0.4297070083645049]
We present an unsupervised approach leveraging the concept of Group Steiner Trees ( GST) GST can identify the most relevant candidates for entity disambiguation using the contextual similarities across candidate entities. We outperform the state-of-the-art unsupervised methods by more than 40% (in avg.) in terms of Precision@1 across various domain-specific datasets.
arXiv Detail & Related papers (2024-12-13T11:35:00Z)
From Attributes to Natural Language: A Survey and Foresight on Text-based Person Re-identification [4.400729890122927]
The aim of text-based person Re-ID is to recognize specific pedestrians by scrutinizing attributes/natural language descriptions. There is a notable absence of comprehensive reviews dedicated to summarizing the text-based person Re-ID from a technical perspective. We introduce a taxonomy spanning Evaluation, Strategy, Architecture, and Optimization dimensions, providing a comprehensive survey of the text-based person Re-ID task.
arXiv Detail & Related papers (2024-07-31T18:16:18Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Anonymizing text that contains sensitive information is crucial for a wide range of applications.<n>Existing techniques face the emerging challenges of the re-identification ability of large language models.<n>We propose a framework composed of three key components: a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity [50.91030850662369]
Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. We contribute a new benchmark named textbfUFineBench for text-based person retrieval with ultra-fine granularity.
arXiv Detail & Related papers (2023-12-06T11:50:14Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Labeling without Seeing? Blind Annotation for Privacy-Preserving Entity Resolution [1.6385815610837167]
We propose a novel blind annotation protocol based on homomorphic encryption.<n>We show that our protocol achieves more than 90% compared with the real ground truths.
arXiv Detail & Related papers (2023-08-07T17:32:33Z)
The Fellowship of the Authors: Disambiguating Names from Social Network Context [2.3605348648054454]
Authority lists with extensive textual descriptions for each entity are lacking and ambiguous named entities. We combine BERT-based mention representations with a variety of graph induction strategies and experiment with supervised and unsupervised cluster inference methods. We find that in-domain language model pretraining can significantly improve mention representations, especially for larger corpora.
arXiv Detail & Related papers (2022-08-31T21:51:55Z)
Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA) Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources. We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z)
DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval [40.70100506088116]
We propose a novel Deep Surroundings-person Separation Learning (DSSL) model in this paper. A surroundings-person separation and fusion mechanism plays the key role to realize an accurate and effective surroundings-person separation. Extensive experiments are carried out to evaluate the proposed DSSL on CUHK-PEDES.
arXiv Detail & Related papers (2021-09-12T15:09:09Z)
De-identification of Privacy-related Entities in Job Postings [10.751883216434717]
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow.
arXiv Detail & Related papers (2021-05-24T12:01:22Z)
DomainMix: Learning Generalizable Person Re-Identification Without Human Annotations [89.78473564527688]
This paper shows how to use labeled synthetic dataset and unlabeled real-world dataset to train a universal model. In this way, human annotations are no longer required, and it is scalable to large and diverse real-world datasets. Experimental results show that the proposed annotation-free method is more or less comparable to the counterpart trained with full human annotations.
arXiv Detail & Related papers (2020-11-24T08:15:53Z)
CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search [89.48123965553098]
This paper presents a search system to alleviate the special domain adaption problem. The system utilizes the domain-adaptive pretraining and few-shot learning technologies to help neural rankers mitigate the domain discrepancy. Our system performs the best among the non-manual runs in Round 2 of the TREC-COVID task.
arXiv Detail & Related papers (2020-11-03T09:10:48Z)
Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG) It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.