分词语料库中四字格的切分和识别研究-语言学及应用语言学专业论文.docxVIP

下载本文档

2
0
约5.34万字
约 54页
2018-12-06 发布于上海
举报
版权申诉

分词语料库中四字格的切分和识别研究-语言学及应用语言学专业论文.docx

1、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。。
2、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。
4、该文档为VIP文档，如果想要下载，成为VIP会员后，下载免费。
5、成为VIP后，下载本文档将扣除1次下载权益。下载后，不支持退款、换文档。如有疑问请联系我们。
6、成为VIP后，您将拥有八大权益，权益包括：VIP文档下载权益、阅读免打扰、文档格式转换、高级专利检索、专属身份标志、高级客服、多端互通、版权登记。
7、VIP文档为合作方或网友上传，每下载1次，网站将根据用户上传文档的质量评分、类型等，对文档贡献者给予高额补贴、流量扶持。如果你也想贡献VIP文档。上传文档

分词语料库中四字格的切分和识别研究-语言学及应用语言学专业论文

PAGE PAGE 5 中文摘要四字格的能产性和派生性极强，利用四字格模式创造出的新词数量在现代汉语词汇中仍然呈上升趋势，因此对四字格的研究不能仅仅局限于文献和理论。本文将研究的目光投向分词语料库中数量众多的四字格，并针对分词语料库中四字格易被切碎的特点，进行了四字格抽取、四字格切分不一致比较、四字格识别等一系列工作。本文首先对分词语料库中的四字分词单位进行提取和筛选，得到四字格抽取结果；并利用四字格抽取结果，进行了分词语料库内部和分词语料库之间的四字格切分不一致比较工作。在四字格识别研究中，通过引入 crf 统计模型，并将四字格切分不一致结果作为模型训练语料，本文在有词性标注分词语料库中进行了四字格的识别研究。利用 crf 模型识别得到的四字格结果，统计四字格用字、词性信息并观察内部结构特点总结规则，本文在无词性标注分词语料库中也进行了四字格的识别研究。识别结果表明，在有词性标注分词语料库和无词性标注分词语料库中，四字格的识别精度都能达到 90%左右。关键字：分词语料库，四字格抽取，切分不一致，CRF Abstract The productive and derivative of four-character idioms are extremely high, the use of four-character pattern to create new words in the vocabulary of modern Chinese still on the rise, so the works on four-character idioms can not be limited to research and theoretical literature. This article will look into the eyes of the large number of four-character idioms in word-segmented corpora, and works for extraction, segmented comparison, recognition and a series of work of four-character idioms for the easily shred characteristics of four-character idioms in word-segmented corpora. This article first works on the fourth sub-word units in word-segmented corpora for extraction and screening, in order to take the results as the four-character idioms extraction; and works on the segmented comparison of four-character idioms both in single segmented corpora and between different segmented corpora by using the results of four-character idioms extraction . In the works of four-character idioms recognition, through the introduction of crf statistical model, and take the results of segmented comparison of four-character idioms as a training corpora of the model, in this article we develop the research of the recognition of four-character idioms in POS-tagged corpora. By using the results of the recognition of four-character idioms based on Crf model, and statisticsing the words, POS information, interal features of four-character idioms to summarize rules, in this article we also develop the research of