A Statistical Approach to Extract Chinese Chunk Candidates From Large Corpora电子书.pdfVIP

  • 2
  • 0
  • 约3.94万字
  • 约 9页
  • 2018-11-10 发布于福建
  • 举报

A Statistical Approach to Extract Chinese Chunk Candidates From Large Corpora电子书.pdf

A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora ¨ ZHANG Le, LU Xue-qiang, SHEN Yan-na, YAO Tian-shun Institute of Computer Software Theory. School of Information Science Engineering, Northeastern University Shenyang, 110004 China Email: ejoy@, studystrong@, neu syn@, tsyao@ Abstract The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building example-based machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from large monolingual corpora. The first step is to extract large N-grams (up to 20-gram) from raw corpus. Then two newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial N-gram set to remove some unnecessary N-grams using their frequency information. The two algorithms are efficient (both have a time complexity of O(n)) and can effectively reduce the size of N-gram set up to 50%. Finally, mutual information is used to obtain chunk candidates from reduced N-gram set. Perhaps the biggest contribution of this paper is that it is the first time to apply Fast Statistical Substring Reduction algorithm to large corpora and demonstrate the effectiveness and efficiency of this algorithm which, in our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with different sizes show that this method can extract chunk candidates from corpora of giga bytes

文档评论(0)

1亿VIP精品文档

相关文档