- 1、本文档共8页,可阅读全部内容。
- 2、原创力文档(book118)网站文档一经付费(服务费),不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。
- 3、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。如您付费,意味着您自己接受本站规则且自行承担风险,本站不退款、不进行额外附加服务;查看《如何避免下载的几个坑》。如果您已付费下载过本站文档,您可以点击 这里二次下载。
- 4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“版权申诉”(推荐),也可以打举报电话:400-050-0827(电话支持时间:9:00-18:30)。
查看更多
Automatically Constructing a Corpus of Sentential Paraphrases
Automatically Constructing a Corpus of Sentential Paraphrases
William B. Dolan and Chris Brockett
Natural Language Processing Group
Microsoft Research
Redmond, WA, 98052, USA
{billdol,chrisbkt}@
Abstract
An obstacle to research in automatic
paraphrase identification and genera-
tion is the lack of large-scale, publicly-
available labeled corpora of sentential
paraphrases. This paper describes the
creation of the recently-released Micro-
soft Research Paraphrase Corpus,
which contains 5801 sentence pairs,
each hand-labeled with a binary judg-
ment as to whether the pair constitutes
a paraphrase. The corpus was created
using heuristic extraction techniques in
conjunction with an SVM-based classi-
fier to select likely sentence-level para-
phrases from a large corpus of topic-
clustered news data. These pairs were
then submitted to human judges, who
confirmed that 67% were in fact se-
mantically equivalent. In addition to
describing the corpus itself, we explore
a number of issues that arose in defin-
ing guidelines for the human raters.
1 Introduction
The Microsoft Research Paraphrase Corpus
(MSRP), available for download at
/research/nlp/msr_
paraphrase.htm, consists of 5801 pairs of sen-
tences, each accompanied by a binary judgment
indicating whether human raters considered the
pair of sentences to be similar enough in mean-
ing to be considered close paraphrases. This data
has been published for the purpose of encourag-
ing research in areas relating to paraphrase and
sentential synonymy and inference, and to help
establish a discourse on the proper construction
of paraphrase corpora for training and evalua-
tion. It is hoped that by releasing this corpus,
we will stimulate the publication of similar cor-
pora by others and help move the field toward
adoption of a shared dataset that will permit use-
ful comparisons of results across research efforts.
2 Motivation
The success of Statistical Machine Translation
(SMT) has
您可能关注的文档
- AMESim R12 软件安装指南.pdf
- AMS1117稳压管系列稳压二极管原厂推荐.pdf
- Amyloid-b Immunization Effectively Reduces Amyloid Deposition in FcR__ Knock-Out Mice.pdf
- AM进程 all12268.pdf
- An accelerated Monte Carlo method to solve two-dimensional radiative transfer and molecular.pdf
- An Analysis on Goodness in Oliver Twist 对《雾都孤儿》中奥利弗的善良人性分析.doc
- An Analytic Comparison of RPS Video Repair.pdf
- An annotated corpus and a grammar model of theorem description.pdf
- An efficient domino reaction in ionic liquid Synthesis and biological ev.pdf
- An efficient frequency offset estimator for OFDM systems and its performance characteristics.pdf
最近下载
- DL∕T 5168-2023表A.5 分部工程质量控制资料核查记录.pdf VIP
- 大班数学活动《8的分解组成》PPT课件.ppt
- 【一模】2025年广东省2025届高三高考模拟测试 (一) 数学试卷(含官方答案及解析 ).docx
- 13S201 室外消火栓及消防水鹤安装.docx VIP
- 主题班会课件-师恩难忘-学子感恩-致敬恩师主题班会.ppt
- GB50327-2001住宅装饰装修工程施工规范.docx
- World怎么在参考文献后面添加CrossRef.pdf VIP
- 江苏省房屋建筑和市政基础设施工程标准施工招标文件(适用于资格后审).doc VIP
- 八下英语U3词性变化.docx
- 《机械原理》期末考试试题及答案.docx
文档评论(0)