每天读一篇论文1--ANCIENT CHINESE WORD SEGMENTATION AND PART-OF-SPEECH TAGGING USING DISTANT SUPERVISIO

本文主要是介绍每天读一篇论文1--ANCIENT CHINESE WORD SEGMENTATION AND PART-OF-SPEECH TAGGING USING DISTANT SUPERVISIO,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!

摘要:

we propose a novel augmentation method of ancient Chinese WSG and POS tagging data using distant supervision over parallel corpus.我们提出了一种基于平行语料远程监督的古汉语WSG和词性标注数据扩充方法。

we take advantage of the memorization effects of deep neural networks and a small amount of annotated data to get a model with much knowledge and a little noise, and then we use this model to relabel the ancient Chinese sentences in parallel corpus.我们利用深度神经网络的记忆效应和少量的标注数据得到一个具有较多知识和少量噪声的模型,然后利用该模型对平行语料中的古汉语句子进行重打标号标注。

Experiments show that the model trained over the relabeled data outperforms the model trained over the data generated from distant supervision and the annotated data.实验表明,在重标注数据上训练的模型优于在远程监督生成的数据和标注数据上训练的模型。

two fundamental lexical analysis problems:

  1. Word segmentation (WSG) 分词
  2. part-of-speech (POS) tagging 词性标注

reasons:

  1. the rareness of labeled ancient Chinese corpus
  2. the huge difficulty of manual annotation 

In low-resource scenario, distant supervision can provide ample but inaccurately labeled samples at low cost. Although annotation projection is effective in data augmentation, it has several problems including errors and omissions in tagging the low-resource languages
在低资源场景下,远程监督可以以较低的成本提供充足但不准确的标记样本。尽管标注投影在数据增强方面是有效的,但它在标注低资源语言时存在错误和遗漏等问题

Inspired by the memorization effects of deep neural networks (DNNs), which tend to fit clean data firstly and then fit noisy data gradually,we use the large data obtained from word alignment and the small annotated data to get a model with much knowledge and a little noise. Then we use the model to relabel thelarge data to reduce its errors and omissions. 受深度神经网络( Deep Neural Networks,DNNs )先拟合干净数据再逐步拟合噪声数据的记忆效应的启发,我们利用单词对齐得到的大数据和标注的小数据得到一个具有较多知识和少量噪声的模型。然后我们将模型运用到重打标号中大数据以减少其错误和遗漏。

INTRODUCTION:

we propose a novel data augmentation of ancient Chinese WSG and POS tagging completed in three steps. First, we use Chinese NLP tool to perform WSG and POS tagging on modern Chinese sentences to get sentences with word boundaries and POS tags. Then, we project the word boundaries and POS tags from modern Chinese to ancient Chinese. After that, we train the SIKU-Roberta2 over the large weakly labeled WSG and POS tagging data obtained from distant supervision to get the first stage model. We continue to train the first stage model over the small manually annotated WSG and POS tagging data to get the second stage model. At last, we use the second stage model to relabel the large weakly labeled data generated from distant supervision.我们提出了一种新颖的古汉语词义消歧和词性标注的数据增强方法,分三步完成。首先,我们使用中文NLP工具对现代汉语句子进行WSG和词性标注,得到带有词边界和词性标注的句子。然后,我们对现代汉语到古代汉语的词边界和词性标签进行了投射。之后,我们在远程监督获取的大型弱标注WSG和词性标注数据上训练SIKU - Roberta2得到第一阶段模型。我们继续在手工标注的小型WSG和词性标注数据上训练第一阶段模型,得到第二阶段模型。最后,我们使用第二阶段模型对远程监督产生的大量弱标注数据进行重打标号。

main contributions: 解决了古汉语词义消歧和词性标注中数据稀缺的问题

  1. We introduce distant supervision in ancient Chinese WSG and POS tagging.在古汉语词性标注和词性标注中引入远程监督
  2. We use a parallel corpus to generate large ancient Chinese WSG and POS tagging data using distant supervision.使用一个平行语料库,通过远程监督生成大型古汉语词义标注和词性标注数据
  3. We propose a novel method of denoising and completing the labels of the large weakly labeled data generated from distant supervision by relabeling it.提出了一种新的方法对远程监督产生的大量弱标注数据进行去噪并通过重标注完成标注
  4. Extensive experiments demonstrate the effectiveness of the method of distant supervision and relabeling.通过实验验证了远程监督和重标记方法的有效性

METHOD:

1、use a POS tagging set with 22 tags3, including verb(v), noun(n), adjective(a), person(nr), etc., and a WSG tag set {B, M, E, S}, where ‘B’, ‘M’, ‘E’ represent the beginning, middle and end of a word respectively, and ‘S’ indicates a word only with one character.
we treat the ancient Chinese WSG and POS tagging as one sequence labeling task using a hybrid tag set, which contains 88 tags.

2、Our model consists of a backbone, SIKU-Roberta2, one linear layer and one Conditional Random Fields (CRF) layer for the joint WSG and POS tagging sequence labeling. 
我们的模型由一个主干、SIKU - Roberta2、一个线性层和一个条件随机场( Conditional Random Fields,CRF ) 层组成,用于联合WSG和词性标注序列标注。

 3、we use LTP to perform WSG and POS tagging on the modern Chinese to get sentences with word boundaries and POS tags, and then divide the ancient Chinese sentences into single characters.利用LTP对现代汉语进行WSG和词性标注,得到带有词边界和词性标注的句子,然后将古汉语句子切分为单字。

4、After processing the parallel corpus, we set modern Chinese as source language and ancient Chinese as target language. Then we use GIZA++ to implement word alignment on the parallel corpus, using IBM model 4, an unsupervised generative model, which can find possible pairs of aligned words and calculate their alignment probabilities.经过对平行语料的处理,我们将现代汉语设为源语言,古代汉语设为目标语言。然后我们使用GIZA + +在平行语料上实现词对齐,使用IBM model 4这个无监督的生成式模型,可以找到可能的对齐词对并计算其对齐概率。

5、For each ancient Chinese character, the POS tag of it is obtained from the dictionary based on the POS tag of its aligned modern Chinese word, if it is paired with at least one modern Chinese word, we take the modern Chinese word with the highest alignment probability as its alignment object; if it is not paired with any modern Chinese word, we take it as a single character word and tag it as null value. After that, for adjacent ancient Chinese characters, we combine ones aligned with one modern Chinese word into one word.对于每个古文字,它的POS标签是根据其对齐的现代汉语词的POS标签从词典中获得的,如果它至少与一个现代汉语词配对,我们将对齐概率最高的现代汉语词作为其对齐对象;如果没有与任何现代汉语词配对,我们将其作为单字词,并标记为空值。之后,对于相邻的古文字,我们将与一个现代汉语词对齐的字合并为一个词。

6、Denoising and Completing by Relabeling(1)Training over Large Projected Data(2)Training over Small Annotated Data(3)Training over Large Relabeled Data

CONCLUSION:

In this paper, we propose a method of augmentation of ancient Chinese WSG and POS tagging data using distant supervision. Besides, we use the method of relabeling to reduce the noise introduced by distant supervision. Experiments show the effectiveness of distant supervision and relabeling.

这篇关于每天读一篇论文1--ANCIENT CHINESE WORD SEGMENTATION AND PART-OF-SPEECH TAGGING USING DISTANT SUPERVISIO的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!



http://www.chinasem.cn/article/489435

相关文章

springboot集成easypoi导出word换行处理过程

《springboot集成easypoi导出word换行处理过程》SpringBoot集成Easypoi导出Word时,换行符n失效显示为空格,解决方法包括生成段落或替换模板中n为回车,同时需确... 目录项目场景问题描述解决方案第一种:生成段落的方式第二种:替换模板的情况,换行符替换成回车总结项目场景s

C#使用Spire.Doc for .NET实现HTML转Word的高效方案

《C#使用Spire.Docfor.NET实现HTML转Word的高效方案》在Web开发中,HTML内容的生成与处理是高频需求,然而,当用户需要将HTML页面或动态生成的HTML字符串转换为Wor... 目录引言一、html转Word的典型场景与挑战二、用 Spire.Doc 实现 HTML 转 Word1

Java实现在Word文档中添加文本水印和图片水印的操作指南

《Java实现在Word文档中添加文本水印和图片水印的操作指南》在当今数字时代,文档的自动化处理与安全防护变得尤为重要,无论是为了保护版权、推广品牌,还是为了在文档中加入特定的标识,为Word文档添加... 目录引言Spire.Doc for Java:高效Word文档处理的利器代码实战:使用Java为Wo

使用Python实现Word文档的自动化对比方案

《使用Python实现Word文档的自动化对比方案》我们经常需要比较两个Word文档的版本差异,无论是合同修订、论文修改还是代码文档更新,人工比对不仅效率低下,还容易遗漏关键改动,下面通过一个实际案例... 目录引言一、使用python-docx库解析文档结构二、使用difflib进行差异比对三、高级对比方

Python从Word文档中提取图片并生成PPT的操作代码

《Python从Word文档中提取图片并生成PPT的操作代码》在日常办公场景中,我们经常需要从Word文档中提取图片,并将这些图片整理到PowerPoint幻灯片中,手动完成这一任务既耗时又容易出错,... 目录引言背景与需求解决方案概述代码解析代码核心逻辑说明总结引言在日常办公场景中,我们经常需要从 W

C#高效实现Word文档内容查找与替换的6种方法

《C#高效实现Word文档内容查找与替换的6种方法》在日常文档处理工作中,尤其是面对大型Word文档时,手动查找、替换文本往往既耗时又容易出错,本文整理了C#查找与替换Word内容的6种方法,大家可以... 目录环境准备方法一:查找文本并替换为新文本方法二:使用正则表达式查找并替换文本方法三:将文本替换为图

Java高效实现Word转PDF的完整指南

《Java高效实现Word转PDF的完整指南》这篇文章主要为大家详细介绍了如何用Spire.DocforJava库实现Word到PDF文档的快速转换,并解析其转换选项的灵活配置技巧,希望对大家有所帮助... 目录方法一:三步实现核心功能方法二:高级选项配置性能优化建议方法补充ASPose 实现方案Libre

Python批量替换多个Word文档的多个关键字的方法

《Python批量替换多个Word文档的多个关键字的方法》有时,我们手头上有多个Excel或者Word文件,但是领导突然要求对某几个术语进行批量的修改,你是不是有要崩溃的感觉,所以本文给大家介绍了Py... 目录工具准备先梳理一下思路神奇代码来啦!代码详解激动人心的测试结语嘿,各位小伙伴们,大家好!有没有想

Python实现Word转PDF全攻略(从入门到实战)

《Python实现Word转PDF全攻略(从入门到实战)》在数字化办公场景中,Word文档的跨平台兼容性始终是个难题,而PDF格式凭借所见即所得的特性,已成为文档分发和归档的标准格式,下面小编就来和大... 目录一、为什么需要python处理Word转PDF?二、主流转换方案对比三、五套实战方案详解方案1:

Python清空Word段落样式的三种方法

《Python清空Word段落样式的三种方法》:本文主要介绍如何用python-docx库清空Word段落样式,提供三种方法:设置为Normal样式、清除直接格式、创建新Normal样式,注意需重... 目录方法一:直接设置段落样式为"Normal"方法二:清除所有直接格式设置方法三:创建新的Normal样