Chinese Word Segmentation as Character Taggin.pdfVIP

  • 29
  • 0
  • 约6.1万字
  • 约 20页
  • 2015-10-01 发布于河南
  • 举报
Chinese Word Segmentation as Character Taggin.pdf

Computational Linguistics and Chinese Language Processing Vol. 8, No. 1, February 2003, pp .29-48 29 The Association for Computational Linguistics and Chinese Language Processing Chinese Word Segmentation as Character Tagging * Nianwen Xue Abstract In this paper we report results of a supervised machine-learning approach to Chinese word segmentation. A maximum entropy tagger is trained on manually annotated data to automatically assign to Chinese characters, or hanzi , tags that indicate the position of a hanzi within a word. The tagged output is then converted into segmented text for evaluation. Preliminary results show that this approach is competitive against other supervised machine-learning segmenters reported in previous studies, achieving precision and recall rates of 95.01% and 94.94% respectively, trained on a 237K-word training set. Key Words: Chinese word segmentation, supervised machine-learning, maximum entropy, character tagging 1. Introduction It is generally agreed among researchers that word segmentation is a necessary first step in Chinese language processing. However, unlike English text in which sentences are sequences of words delimited by white spaces, in Chinese text, sentences are represented as strings of Chinese characters or hanzi without similar natural delimiters. Therefore, the first step in a Chinese language processing task is to identify the sequence of words in a sentence and mark boundaries in appropriate places. Thi

文档评论(0)

1亿VIP精品文档

相关文档