蒙古文原始语料统计建模研究StudyofMongolianRawTextModeling.PDFVIP

下载本文档

6
0
约2万字
约 10页
2019-02-01 发布于天津
举报

蒙古文原始语料统计建模研究StudyofMongolianRawTextModeling.PDF

文章编号：1003-0077 （2011）00-0000-00 蒙古文原始语料统计建模研究 白双成1,2 （1.内蒙古社会科学院蒙古语信息技术研发中心，内蒙古呼和浩特 010020 ；（2. 内蒙古蒙科立软件股份有限责任公司，内蒙古呼和浩特 010011 ）摘要：本文针对蒙古文纠错语料稀缺、扩建难度大，原始语料存在严重的拼写多样化和字形拼写错误而无法直接利用的现状，在分析总结蒙古文编码特性基础上，通过搜集整理大规模原始语料和标注部分语料，以蒙古文输入法为技术实现手段和试验平台，重点解决了基于原始语料统计建模和模型优化等研究问题。实验结果证明，该方法可有效提高输入效率，开拓了蒙古文原始本文建模利用的新思路，对所有蒙古文音词转换和形词转换研究都有广泛的参考价值。关键词：蒙古文原始文本；统计建模；读音错误；字形错误；智能输入中图分类号：TP391 文献标识码：A Study of Mongolian Raw Text Modeling Bai-Shuangcheng1,2 (1.Inner Mongolia Academy of Social Science,Hohhot, Inner Mongolia 010020, China ; (2.Inner Mongolia Menksoft Co.,Ltd , Hohhot, Inner Mongolia 010011 ,China) Abstract: The corrected Mongolian text is scarce and hard to build, raw text can not be directly used. We based on the analyzing the characteristics of Mongolian encoding, through the collection and collation of large-scale raw text corpus and part of the corpus annotation, as for Mongolian input technology application, mainly solved the problem of statistical modeling and model optimization of original text. Experimental results show that the method can effectively improve the input efficiency. We developed a new idea of using the original Mongolian text modeling, the results in this paper can be directly applied to all Phoneme-to-Word Conversion and Grapheme-to-Word Conversion problems. Key words: Mongolian Raw Text; Spelling Diversity Phenomena; intelligent Input Method; Spelling Error 1 引言自然语言处理广泛使用统计语言模型（Statistical Language Model SLM），尤其是自然标注大数据（Naturally Annotated Big Data ）、（深度）机器学习（Deep Machine Learning ）、知识图谱（Knowledge Graph ）等众多方法和理论，促使信息检索（Information Retrieve ）、机器翻译（Machine Translation ）、校对纠错（Spell CheckCorrect ）、知识问答（Question Answering ）等涉及自然语言应用的各领域研究工作获得了较为显著的进展，基于这些研究成果的各类应用投入使用。这些新技术、新方法的共同点是要以大量数据资源为依托。然而，就是由于“可直接利用”的蒙古文数字资源稀缺，方便获取的未纠

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

蒙古文原始语料统计建模研究StudyofMongolianRawTextModeling.PDFVIP