基于数据增强的NLP新方法[Appl. Sci.专栏第十六篇发表论文]
周涛  |  2023-03-25  |  科学网  |  318次阅读

我在Applied Sciences(综合性、交叉性期刊,CiteScore=3.70IF=2.84)组织了一个Special Issue,大题目是“大数据分析进展”,比较宽泛。该专栏的推出主要是为了回应因为可获取数据和数据分析的平台、工具的快速增长给自然科学和社会科学带来的重大影响。我们特别欢迎(但不限于)下面四类稿件:(1)数据分析中的基础理论分析,例如一个系统的可预测性(比如时间序列的可预测性)、分类问题的最小误差分析、各种数据挖掘结果的稳定性和可信度分析;(2)数据分析的新方法,例如挖掘因果关系的新方法(这和Topic 1也是相关的)、多模态分析的新方法、隐私计算的新方法等等;(3)推出新的、高价值的数据集、数据分析平台、数据分析工具等等;(4)把大数据分析的方法用到自然科学和社会科学的各个分支(并获得洞见),我们特别喜欢用到那些原来定量化程度不高的学科。

投稿链接:https://www.mdpi.com/journal/applsci/special_issues/75Y7F7607U 

投稿截止时期为2023年6月30日,我们处理稿件非常快,欢迎大家投稿支持。


其中第十六篇论文已经正式发表:


A Joint Domain-Specific Pre-Training Method Based on Data Enhancement

Abstract

State-of-the-art performances for natural language processing tasks are achieved by supervised learning, specifically, by fine-tuning pre-trained language models such as BERT (Bidirectional Encoder Representation from Transformers). With increasingly accurate models, the size of the fine-tuned pre-training corpus is becoming larger and larger. However, very few studies have explored the selection of pre-training corpus. Therefore, this paper proposes a data enhancement-based domain pre-training method. At first, a pre-training task and a downstream fine-tuning task are jointly trained to alleviate the catastrophic forgetting problem generated by existing classical pre-training methods. Then, based on the hard-to-classify texts identified from downstream tasks’ feedback, the pre-training corpus can be reconstructed by selecting the similar texts from it. The learning of the reconstructed pre-training corpus can deepen the model’s understanding of undeterminable text expressions, thus enhancing the model’s feature extraction ability for domain texts. Without any pre-processing of the pre-training corpus, the experiments are conducted for two tasks, named entity recognition (NER) and text classification (CLS). The results show that learning the domain corpus selected by the proposed method can supplement the model’s understanding of domain-specific information and improve the performance of the basic pre-training model to achieve the best results compared with other benchmark methods.


论文免费下载链接:

https://www.mdpi.com/2076-3417/13/7/4115  





文章原载于作者的科学网文章,所述内容属作者个人观点,不代表本平台立场。
本文经过系统重新排版,阅读原内容可点击 阅读原文