TP391
The Basic Processing of Contemporary Chinese Corpus at Peking University
SPECIFICATION
YU Shi-wen DUAN Hui-ming ZHU Xue-feng Bing SWEN
(Institute of Computational Linguistics, Peking University, Beijing, 100871)
Abstract: The Institute of Computational Linguistics, Peking University has completed the basic
processing of a contemporary Chinese corpus that has 27 million Chinese Characters. In addition to
word segmentation and part-of-speech tagging, the processing involves the tagging of proper nouns
(person names, place names, organization names and so on), morpheme subcategories and the
special usages of verbs and adjectives. The success of this large-scale language engineering is
attributed to the SPECIFICATION, which had been made beforehand and was being perfected while
in use. We are hereby making an introduction to the SPECIFICATION through this publication, thus
inviting the comments from all the experts and our colleagues for the improvement of it.
Keywords: contemporary Chinese; corpus; word segmentation; part-of-speech tagging; specification
69483003973 863 985
原创力文档

文档评论(0)