基于数据增强的NLP新方法[Appl. Sci.专栏第十六篇发表论文]

周涛 | 2023-03-25 | 科学网 | 844次阅读

我在Applied Sciences（综合性、交叉性期刊，CiteScore=3.70；IF=2.84）组织了一个Special Issue，大题目是“大数据分析进展”，比较宽泛。该专栏的推出主要是为了回应因为可获取数据和数据分析的平台、工具的快速增长给自然科学和社会科学带来的重大影响。我们特别欢迎（但不限于）下面四类稿件：（1）数据分析中的基础理论分析，例如一个系统的可预测性（比如时间序列的可预测性）、分类问题的最小误差分析、各种数据挖掘结果的稳定性和可信度分析；（2）数据分析的新方法，例如挖掘因果关系的新方法（这和Topic 1也是相关的）、多模态分析的新方法、隐私计算的新方法等等；（3）推出新的、高价值的数据集、数据分析平台、数据分析工具等等；（4）把大数据分析的方法用到自然科学和社会科学的各个分支（并获得洞见），我们特别喜欢用到那些原来定量化程度不高的学科。

投稿链接：https://www.mdpi.com/journal/applsci/special_issues/75Y7F7607U

投稿截止时期为2023年6月30日，我们处理稿件非常快，欢迎大家投稿支持。

其中第十六篇论文已经正式发表：

A Joint Domain-Specific Pre-Training Method Based on Data Enhancement

Abstract

State-of-the-art performances for natural language processing tasks are achieved by supervised learning, specifically, by fine-tuning pre-trained language models such as BERT (Bidirectional Encoder Representation from Transformers). With increasingly accurate models, the size of the fine-tuned pre-training corpus is becoming larger and larger. However, very few studies have explored the selection of pre-training corpus. Therefore, this paper proposes a data enhancement-based domain pre-training method. At first, a pre-training task and a downstream fine-tuning task are jointly trained to alleviate the catastrophic forgetting problem generated by existing classical pre-training methods. Then, based on the hard-to-classify texts identified from downstream tasks’ feedback, the pre-training corpus can be reconstructed by selecting the similar texts from it. The learning of the reconstructed pre-training corpus can deepen the model’s understanding of undeterminable text expressions, thus enhancing the model’s feature extraction ability for domain texts. Without any pre-processing of the pre-training corpus, the experiments are conducted for two tasks, named entity recognition (NER) and text classification (CLS). The results show that learning the domain corpus selected by the proposed method can supplement the model’s understanding of domain-specific information and improve the performance of the basic pre-training model to achieve the best results compared with other benchmark methods.

论文免费下载链接：

https://www.mdpi.com/2076-3417/13/7/4115

文章原载于作者的科学网文章，所述内容属作者个人观点，不代表本平台立场。

本文经过系统重新排版,阅读原内容可点击阅读原文

热榜

大数据与人工智能的伦理挑战（1）

磨刀不误砍柴工

诸神归位——我电院系调整的原因及必要性分析

妈妈给了我什么？——兼谈儿童教育

开题：创新点如何凝练

成电建校史

大数据与人工智能的伦理挑战（2）

专业放大镜：生物技术（生物-信息复合培养实验班）

推荐描述危机时刻的选择的短篇小说《堪萨斯》

网络信息挖掘的关键算法研究（上）

随便看看

互联网科学中心刘影在《Scientific Reports》上连续发表论文

靠编译器能否有效提升多核处理器能效？

奔波的一天。[威武]复习期末考试的人儿们[心]圣诞快乐[心]

有童鞋问我还有木有在跑步[跑]，答案嘛，就是：木有什么可以阻挡[微风][耶]#

首次以班导师的身份参加同学们的成长分享会。听他们说着大一生活，或文艺或感动或自嘲或云淡风轻地谈论各种取舍得失，不知他们是怎样的感受，老师我真的有被默默激励和触动到：我们的95后独立、自主、坦率、理智和勇敢，再多些自律和耐心，认真扎实和坚持，假以时日，必是力量。#

H=H+2;H=60

在知乎上发了篇文章

教授进中学之浙江行

向着阳光的方向成长

谁会引领中国电子信息的未来