ArnetMiner–ExtractionandMiningofAcademicSocialNetworks.pptVIP

  • 131
  • 0
  • 约 52页
  • 2017-01-12 发布于天津
  • 举报

ArnetMiner–ExtractionandMiningofAcademicSocialNetworks.ppt

ArnetMiner–ExtractionandMiningofAcademicSocialNetworks

* We identify tokens by using heuristics. There are five types of tokens: ‘standard word’, ‘non-standard word’, punctuation mark, space, and line break. Standard words are words in natural language. Non-standard words include several general ‘special words’, for example, email address, IP address, URL, date, number, and so on. We identify non-standard words by using regular expressions. Punctuation marks include period, question mark, and exclamation mark. Words and punctuation marks are separated into different tokens if they are joined together. Natural spaces and line breaks are also regarded as tokens respectively. * Address and affiliation always contain many tokens, the dependencies between the tokens can help improve the accuracy, other approaches can not utilize the dependencies * * The simplifying form is popular in bibliographic records. * The distributions can be typically categorized into the following cases: (1) publications of different persons are clearly separated (“Hui Fang”, in Figure 5 (a)). Name disambiguation on this kind of data can be solved pretty well by our approach and the number K can also be found accurately; (2) publications are mixed together but with a dominant author who writes most of the papers (e.g., “Bing Liu”, in Figure 5 (b)); our approach can achieve a F1-score of 87.36% and found K that is close to the actual number; and (3) publications of different authors are mixed (e.g., “Jing Zhang” and “Yi Li”, in Figure 5 (c) and (d)). Our method can obtain a performance of 91.25% and 82.11% in terms of F1-measure. However, it would be difficult to accurately find the number K. For example, the number found by our approach for “Jing Zhang” is 14, but the correct number should be 25. * * * * 两个研究者要写一篇论文。论文中每个单词的生成都符合这个过程,首先选择一个作者负责这个单词的生成,这个作者又按照一定的概率分布生成了一个话题,这个话题按照一定的概率分布生成了这个单词和会议。依次类推,我们就生成了整片论文。但是论文的内容不都是原创的,有一部分是参考文献中的方法,为了建立起参考文献和论文内容的关系,我们在生成参考文献的时候,为每个参考文献选一个话题,然后按照一定的概率分布生成当前的参考文献。 * * * Modeling the Academic Network and Appl

文档评论(0)

1亿VIP精品文档

相关文档