- 6
- 0
- 约5.97万字
- 约 64页
- 2018-05-18 发布于上海
- 举报
基于编辑距离的字符串相似连接的分析-analysis of string similarity connection based on edit distance
AbstractString similarity join has become an essential operator in many applications where near-duplicate objects need to be found, such as coalition detection, fuzzy keyword match- ing, data integration, data cleaning. This paper focuses on string similarity join with edit distance constraints which measures the similarity of two strings by the minimum number of edit operators (insertion, deletion, and substitution of single characters) to transform one string to the other.This paper takes the frequencies of single characters as well as other statistics as global information of strings. Specifically, a novel partition-based algorithm is develope- d to utilize such information to enumerate smaller candidate set in a more e?cient way by partitioning dataset into small chunks. Some new filter is proposed to leverage the frequencies of single characters. It has a low complexity and can prune away many can- didate pairs which pass through the existing filters. We experimentally verify the superior e?ciency of our algorithm to alternative methods, using real datasets.Based on disk algorithm, we also implement a disk-algorithm framework. The disk scheduling problem is proved to be NP-complete, and several heuristics are proposed to solve this problem. The incremental computation, via the proposed disk algorithm framework, is also discussed. Experiments verified the key idea in this paper, which laid the foundation for future disk algorithms.Keywords: string similarity join, edit distance, frequent vector, data partition目录摘要IABSTRACT II第 1 章 绪论 11.1 课题背景11.1.1 课题来源11.1.2 课题目的及意义 11.2 国内外研究现状 2All-Pairs 和 Ed-Join 算法2Trie-Join 算法4Pass-Join 算法61.3 本文主要研究内容 81.4 本文组织结构8第 2 章 频率向量及数据划分 92.1 预备知识与频率向量 92.2 频率过滤的研究 102.2.1 L1 频率过滤102.2.2 斜差分距离 132.3 组合字符以及频率区间的划分 172.3.1 数据划分综述 172.3.2 区间划分以及组合字符的选取 182.4 本章小结 22第 3 章 基于数据划分内存方法 233.1 过滤方法及分析 233.1.1 字符串与字符串之间的过滤 233.1.2 数据子集与数据子集之间的过滤 243.1.3 字符串和数据子集之间的过滤 253.2 基于频率向量数据划分的算法 283.3 实验结果及分析 2
您可能关注的文档
- 基于wi-fi无线网络的液位控制系统研究-research on liquid level control system based on wi - fi wireless network.docx
- 基于wimax应用的δσ小数分频频率合成器的分析与芯片实现-analysis and chip implementation of δ σ fractional crossover frequency synthesizer based on wimax application.docx
- 基于wimax网络架构的asn-gw入网退网管理研究及实现-research and implementation of asn - gw network access and withdrawal management based on wimax network architecture.docx
- 基于win ce平台智能手机多媒体系统的设计-design of smart phone multimedia system based on win ce platform.docx
- 基于winbugs软件在含结构零的2×2列联表中风险差和风险比的贝叶斯置信区间的计算及样本量的确定-calculation of bayesian confidence interval of risk difference and risk ratio and determination of sample size based on win bugs software in 2× 2 contingency table with structural ze.docx
- 基于wim数据上海长江隧桥箱梁细节寿命评估-detailed life assessment of box girder in shanghai yangtze river tunnel bridge based on wim data.docx
- 基于wince的铁路桥梁检测系统接收终端的分析-analysis of receiving terminal of railway bridge inspection system based on wince.docx
- 基于wince的高解析喷码机驱动程序的分析与开发-analysis and development of high-resolution inkjet printer driver based on wince.docx
- 基于windows 7的木马可生存性关键技术的研究与实现-research and implementation of key technologies of trojan horse survivability based on windows 7.docx
- 基于wince的嵌入式gps系统分析与开发-analysis and development of embedded gps system based on wince.docx
- 基于边缘效应的襄阳城市发展战略分析-analysis of xiangyang urban development strategy based on edge effect.docx
- 基于边缘显著度的小波图像融合方法分析-analysis of wavelet image fusion method based on edge saliency.docx
- 基于编码辅助的深空通信载波同步技术分析-analysis of carrier synchronization technology in deep space communication based on coding assistance.docx
- 基于编译实现微线程的故障检测机制关键技术分析-analysis of key technologies of micro-thread fault detection mechanism based on compilation.docx
- 基于编译中间代码的关键变量容错技术-fault tolerance technology of key variables based on compiling intermediate code.docx
- 基于变动成本法的全面预算管理分析-jd水泥全面预算管理案例解析-analysis of comprehensive budget management based on variable cost method - analysis of jd cement comprehensive budget management case.docx
- 基于变步长采样的产品装配结构正确性自动检测技术分析-analysis of automatic inspection technology for correctness of product assembly structure based on variable step sampling.docx
- 基于变化管理的本体协同开发方法分析-analysis of ontology collaborative development method based on change management.docx
- 基于编译网格的作业管理系统分析与实现-analysis and implementation of job management system based on compilation grid.docx
- 基于变革与创新的电信企业竞争优势分析-analysis on competitive advantages of telecom enterprises based on innovation and innovation.docx
原创力文档

文档评论(0)