- 2
- 0
- 约3.94万字
- 约 9页
- 2018-11-10 发布于福建
- 举报
A Statistical Approach to Extract Chinese Chunk
Candidates from Large Corpora
¨
ZHANG Le, LU Xue-qiang, SHEN Yan-na, YAO Tian-shun
Institute of Computer Software Theory.
School of Information Science Engineering, Northeastern University
Shenyang, 110004 China
Email: ejoy@, studystrong@, neu syn@, tsyao@
Abstract
The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building example-based
machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from
large monolingual corpora. The first step is to extract large N-grams (up to 20-gram) from raw corpus. Then two
newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial N-gram set to
remove some unnecessary N-grams using their frequency information. The two algorithms are efficient (both have a
time complexity of O(n)) and can effectively reduce the size of N-gram set up to 50%. Finally, mutual information
is used to obtain chunk candidates from reduced N-gram set.
Perhaps the biggest contribution of this paper is that it is the first time to apply Fast Statistical Substring
Reduction algorithm to large corpora and demonstrate the effectiveness and efficiency of this algorithm which, in
our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with different
sizes show that this method can extract chunk candidates from corpora of giga bytes
您可能关注的文档
- [EDAW]Building for life电子书.pdf
- 2003 Challenging Conventional Wisdom电子书.pdf
- 2007 Synchronization in Complex Networks of Nonlinear Dynamical Systems电子书.pdf
- 2013 Survey on Image Segmentation Using Graph Based Methods电子书.pdf
- A comparative study of Caucasian and Asian visitors to a Cultural Expo in an Asian setting电子书.pdf
- A comparative study of recycling of used lubrication 电子书.pdf
- A Comparative Study on Representing Units in Chinese Text Clustering (1)电子书.pdf
- A diagrammatic analysis of the market for cruising taxis电子书.pdf
- A smoothing algorithm for the task adaption chinese trigram model电子书.pdf
- A survey of vision-based methods for action representation ref电子书.pdf
最近下载
- 硕博研究生英语综合教程郭巍课后习题答案解析.pdf
- AP化学 2018年真题 (选择题+问答题) AP Chemistry 2018 Released Exam and Answers (MCQ+FRQ).pdf VIP
- 武汉地区区域工程地质及水文地质概况、成井方法、基坑工程连通性抽水试验、降水水位计算与预测、回灌试验技术要求.pdf VIP
- 《中国消化内镜技术诊断与治疗炎症性肠病的专家指导意见》解读.pptx
- 《食品感官检验技术》高职食品专业全套教学课件.pptx
- 组织碳中和声明模板.docx VIP
- 2024-2025学年湖北省襄阳市小升初英语真题1(含答案).pdf VIP
- 第九章射线分析原理及应用.ppt VIP
- (人教版2026新教材)地理八年级下册新教材解读课件.pptx
- 组织碳中和评价报告模板.docx VIP
原创力文档

文档评论(0)