基于大规模语料的中文新词识别技术分析-analysis of chinese neologism recognition technology based on large-scale corpus.docx
基于大规模语料的中文新词识别技术分析-analysis of chinese neologism recognition technology based on large-scale corpus
摘要中文新词识别是指从未标注文本语料中抽取新词并识别其属性的过程,是 中文信息处理领域的一项基础任务,其识别结果直接影响着分词、句法分析的 处理性能,同时亦在信息抽取及机器翻译等领域有着广泛应用,具有重要的理 论意义和实用价值。汉语具有极强的构词能力,加之词语间没有特定的分隔标记,导致任何两 个以上相邻字符都有成词的可能性,这给新词自动识别带来了极大困难;同时海量数据应用需求激增又给新词识别研究带来了新的挑战。为改善新词识别性 能,提高实用性,本文以大规模语料为研究对象,应用规则和统计相结合的策略,对新词识别及相关技术进行了研究,主要工作和特点如下: 首先,本文设计并初步实现了一个领域无关的新词识别框架 FNWI。该框架对新词识别系统的灵活性、可扩展性和可维护性进行了统一规划,FNWI 不仅是本文研究展开的总体设计方案,还将为后继工作提供一个良定义的基础结 构。为有效处理大规模语料,本文提出一种基于逐层剪枝的重复模式提取算法。该算法通过低频字符剪枝和层次剪枝来减少重复模式提取过程中垃圾字符串的 产生,有效地降低 I/O 读写次数。具有能快速处理规模远大于内存容量的语料、 语料读写次数与规模接近线性关系;及使用灵活、可提取特定频率/特定长度的重复模式的特点。为提高候选重复模式的归并速度,本文还提出了一种改进的 字符串排序算法,其时间复杂度为 O(dn)。在新词检测阶段,为提高检测速度,提出一种高效的左(右)熵计算方法,有效减少了计算时无关字符的影响,显著地提高熵的计算效率;为分析重复模 式提取策略(基于字符和基于预先分词)对检测效果的影响,提出一种应用实 验数据对比和量化模型分析相结合的评测方法,并给出了一个实用的候选新词遗漏量化分析模型,用以指导新词检测的实施。 最后,对新词词性分类,本文提出一种新词词性猜测的形式化模型,并应用条件随机域实现模型求解。通过对模型分析,确定了特征选取的原则和思路。该方法最大特点是以词性内部特征为主,不使用上下文词性,具有更强的实用 性。关键词:中文新词识别重复模式逐层剪枝字符串排序新词检测条件 随机域上下文特征词性猜测IAbstractNew Word Identification (NWI) for Chinese is an essential task in the domain of Chinese information processing, which means the process of extracting new words from non-tagged text corpus and identifying their properties. The identification result will directly affect the processing performance of many tasks such as Chinese Word Segmentation (CWS) and syntax analysis. NWI also has wide applications in certain areas such as information extraction and machine translation. Therefore, NWI possesses important theoretical significance and practical value.Since Chinese has a very strong word-formatting ability and there is no specific tag between Chinese words, any two or more than two adjacent Chinese characters may format a word, which causes great difficulties in new words automatic identification. At the same time, the dramatic increases in huge amounts of data application have brought a new challenge for NWI. In this thesis, we have carried out the studies on NWI and its related technologies by taking the large-scale corpus as study object and employing the strategy of combinations of
您可能关注的文档
- 基于超图分割的共指消解分析-analysis of coreference resolution based on hypergraph segmentation.docx
- 基于超像素的高光谱图像分类算法分析-analysis of hyperspectral image classification algorithm based on superpixels.docx
- 基于超像素的面向对象遥感图像分类方法分析-analysis of object-oriented remote sensing image classification method based on superpixel.docx
- 基于潮流介数和upfc的小世界电网连锁故障模型分析-analysis of cascading failure model of small world power grid based on power flow betweenness and upfc.docx
- 基于超越概率和rbf神经网络的边坡稳定性评价模型分析-analysis of slope stability evaluation model based on transcendental probability and rbf neural network.docx
- 基于车联网的交通信息采集与事故识别方法研究-research on traffic information collection and accident identification method based on vehicle networking.docx
- 基于超声驻波的微纳粒子收集机理及实验分析-collection mechanism and experimental analysis of micro-nano particles based on ultrasonic standing wave.docx
- 基于车地通信的可靠性检测装置关键技术的分析-analysis of key technologies of reliability detection device based on vehicle-ground communication.docx
- 基于车辆运行的轨道振动能量回收系统分析-analysis of track vibration energy recovery system based on vehicle operation.docx
- 基于车轮收放的高速水陆两栖车虚拟样机分析-virtual prototype analysis of high-speed amphibious vehicle based on wheel retraction and retraction.docx
- 基于大健康产业下的企业价值评估 ——以天士力公司为例-enterprise value assessment based on big health industry - a case study of tianshi company.docx
- 基于大客户关系管理的唐山供电公司电力营销策略分析-analysis of power marketing strategy of tangshan power supply company based on big customer relationship management.docx
- 基于大脑情感学习模型的球磨机控制策略分析-analysis of ball mill control strategy based on brain emotional learning model.docx
- 基于大流动性混凝土拌合物流变性能的试验方法分析-analysis of test method based on rheological property of high fluidity concrete mixture.docx
- 基于大批量定制生产方式的客户订单处理方法分析-analysis of customer order processing method based on mass customization production mode.docx
- 基于大数据的商业模式创新研究 ——以阿里小贷的商业模式创新为例-research on the innovation of business model based on big data - taking ali's small loan business model innovation as an example.docx
- 基于大数据的td-lte基站辅助规划选址算法研究-research on location algorithm of td - lte base station aided planning based on big data.docx
- 基于大数据的hadoop并行计算优化处理性能研究-research on hadoop parallel computing optimization processing performance based on big data.docx
- 基于大数据的商业银行中小企业金融服务分析-analysis of financial services for small and medium-sized enterprises in commercial banks based on big data.docx
- 基于大调解机制的我国社会矛盾问题分析-analysis of social contradictions in china based on big mediation mechanism.docx
原创力文档

文档评论(0)