- 1、原创力文档(book118)网站文档一经付费(服务费),不意味着购买了该文档的版权,仅供个人/单位学习、研究之用,不得用于商业用途,未经授权,严禁复制、发行、汇编、翻译或者网络传播等,侵权必究。。
- 2、本站所有内容均由合作方或网友上传,本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺!文档内容仅供研究参考,付费前请自行鉴别。如您付费,意味着您自己接受本站规则且自行承担风险,本站不退款、不进行额外附加服务;查看《如何避免下载的几个坑》。如果您已付费下载过本站文档,您可以点击 这里二次下载。
- 3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等,请点击“版权申诉”(推荐),也可以打举报电话:400-050-0827(电话支持时间:9:00-18:30)。
- 4、该文档为VIP文档,如果想要下载,成为VIP会员后,下载免费。
- 5、成为VIP后,下载本文档将扣除1次下载权益。下载后,不支持退款、换文档。如有疑问请联系我们。
- 6、成为VIP后,您将拥有八大权益,权益包括:VIP文档下载权益、阅读免打扰、文档格式转换、高级专利检索、专属身份标志、高级客服、多端互通、版权登记。
- 7、VIP文档为合作方或网友上传,每下载1次, 网站将根据用户上传文档的质量评分、类型等,对文档贡献者给予高额补贴、流量扶持。如果你也想贡献VIP文档。上传文档
查看更多
DEPARTMENT OF COMPUTER SCIENCE GLIMPSE(计算机科学部门,一瞥)
Finding Similar Files in a Large File System
Udi Manber
TR 93-33
October 1993
DEPARTMENT OF COMPUTER SCIENCE
To appear in the
1994 Winter USENIX Technical Conference
FINDING SIMILAR FILES IN A LARGE FILE SYSTEM
Udi Manber1
Department of Computer Science
University of Arizona
Tucson, AZ 85721
udi@
ABSTRACT
We present a tool, called sif, for finding all similar files in a large file system. Files are considered similar if they
have significant number of common pieces, even if they are very different otherwise. For example, one file may be
contained, possibly with some changes, in another file, or a file may be a reorganization of another file. The run-
ning time for finding all groups of similar files, even for as little as 25% similarity, is on the order of 500MB to
1GB an hour. The amount of similarity and several other customized parameters can be determined by the user at
a post-processing stage, which is very fast. Sif can also be used to very quickly identify all similar files to a query
file using a preprocessed index. Application of sif can be found in file management, information collecting (to
remove duplicates), program reuse, file synchronization, data compression, and maybe even plagiarism detection.
1. Introduction
Our goal is to identify files that came from the same source or contain parts that came f
您可能关注的文档
- Crosby style JOSE and JBSE spring loaded API (克罗斯比风格JOSE和JBSE弹簧API).pdf
- Cross Selling Up Selling Sales Training(交叉销售和向上销售,销售培训).pdf
- Crossing the Line The University of North (北大学的越界,).pdf
- CrossCurricular Reading Comprehension (跨学科阅读理解).pdf
- Critical technologies towards 5G VTT.fi(对5 g VTT.fi关键技术).pdf
- crossword puzzles American English(填字游戏美式英语).pdf
- CROWN SUPER ACRYLIC Plascon(超级丙烯酸 Plascon冠冕).pdf
- Crude Oil, Condensate, and Produced Water (原油、冷凝,产生水).pdf
- CRT or LCD barcohealthcare.jp(CRT和LCD barcohealthcare.jp).pdf
- Cryogenics and Ultra Low Temperatures EOLSS(低温和超低温度EOLSS).pdf
- DEPARTMENT OF AUTOMOBILES, SHIPS AND (汽车、船只和).pdf
- DEPARTMENT OF MATHEMATICS Banaras (数学系,贝拿勒斯).doc
- Dell UltraSharp U2412M 24” Monitor with LED (戴尔UltraSharp U2412M 24u201C监视与).pdf
- Department of Mechanical Engineering UET (机械工程系,UET).pdf
- Department of Physics Physics Lab Viva Voce (物理系物理实验室口试).pdf
- DEPARTMENT OF PHYSICS AND ASTRONOMY(物理学和天文学).pdf
- DEPARTMENT of PUBLIC WORKS PARKS (公共工程部门公园).pdf
- Department of Medicine E Geriatric (医学系的E老年).pdf
- DEPARTMENT OF TRANSPORTATION Truck (交通部卡车).pdf
- Department of Transport K53 Arrive Alive(运输部门K53活着到达).pdf
最近下载
- 新人音版二年级音乐下册优秀教学设计《共产儿童团歌》教案.doc VIP
- 肝硬化诊治指南2025年.docx
- 初中数学新人教版八年级上册13综合与实践 确定匀质薄板的重心位置教学课件2025秋.pptx VIP
- 08【人教版英语字帖】八年级上册单词表衡水体字帖(新目标含音标).pdf VIP
- 2023年10月自考06089劳动关系与劳动法押题及答案.pdf VIP
- 新教科版小学科学实验目录五年级上册.docx VIP
- 合理用药用药班会PPT课件.pptx VIP
- 结构设计弯矩二次分配法计算(表格自带公式).xls VIP
- 规范言行从我做起主题班会.pptx
- PanelView Plus 7 Performance 终端用户手册.pdf VIP
文档评论(0)