A Statistical Approach to Extract Chinese Chunk Candidates From Large Corpora电子书.pdfVIP

下载本文档

2
0
约3.94万字
约 9页
2018-11-10 发布于福建
举报

A Statistical Approach to Extract Chinese Chunk Candidates From Large Corpora电子书.pdf

A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora ¨ ZHANG Le, LU Xue-qiang, SHEN Yan-na, YAO Tian-shun Institute of Computer Software Theory. School of Information Science Engineering, Northeastern University Shenyang, 110004 China Email: ejoy@, studystrong@, neu syn@, tsyao@ Abstract The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building example-based machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from large monolingual corpora. The ﬁrst step is to extract large N-grams (up to 20-gram) from raw corpus. Then two newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial N-gram set to remove some unnecessary N-grams using their frequency information. The two algorithms are eﬃcient (both have a time complexity of O(n)) and can eﬀectively reduce the size of N-gram set up to 50%. Finally, mutual information is used to obtain chunk candidates from reduced N-gram set. Perhaps the biggest contribution of this paper is that it is the ﬁrst time to apply Fast Statistical Substring Reduction algorithm to large corpora and demonstrate the eﬀectiveness and eﬃciency of this algorithm which, in our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with diﬀerent sizes show that this method can extract chunk candidates from corpora of giga bytes

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

A Statistical Approach to Extract Chinese Chunk Candidates From Large Corpora电子书.pdfVIP