基于编辑距离的字符串相似连接的分析-analysis of string similarity connection based on edit distance.docxVIP

  • 6
  • 0
  • 约5.97万字
  • 约 64页
  • 2018-05-18 发布于上海
  • 举报

基于编辑距离的字符串相似连接的分析-analysis of string similarity connection based on edit distance.docx

基于编辑距离的字符串相似连接的分析-analysis of string similarity connection based on edit distance

AbstractString similarity join has become an essential operator in many applications where near-duplicate objects need to be found, such as coalition detection, fuzzy keyword match- ing, data integration, data cleaning. This paper focuses on string similarity join with edit distance constraints which measures the similarity of two strings by the minimum number of edit operators (insertion, deletion, and substitution of single characters) to transform one string to the other.This paper takes the frequencies of single characters as well as other statistics as global information of strings. Specifically, a novel partition-based algorithm is develope- d to utilize such information to enumerate smaller candidate set in a more e?cient way by partitioning dataset into small chunks. Some new filter is proposed to leverage the frequencies of single characters. It has a low complexity and can prune away many can- didate pairs which pass through the existing filters. We experimentally verify the superior e?ciency of our algorithm to alternative methods, using real datasets.Based on disk algorithm, we also implement a disk-algorithm framework. The disk scheduling problem is proved to be NP-complete, and several heuristics are proposed to solve this problem. The incremental computation, via the proposed disk algorithm framework, is also discussed. Experiments verified the key idea in this paper, which laid the foundation for future disk algorithms.Keywords: string similarity join, edit distance, frequent vector, data partition目录摘要IABSTRACT II第 1 章 绪论 11.1 课题背景11.1.1 课题来源11.1.2 课题目的及意义 11.2 国内外研究现状 2All-Pairs 和 Ed-Join 算法2Trie-Join 算法4Pass-Join 算法61.3 本文主要研究内容 81.4 本文组织结构8第 2 章 频率向量及数据划分 92.1 预备知识与频率向量 92.2 频率过滤的研究 102.2.1 L1 频率过滤102.2.2 斜差分距离 132.3 组合字符以及频率区间的划分 172.3.1 数据划分综述 172.3.2 区间划分以及组合字符的选取 182.4 本章小结 22第 3 章 基于数据划分内存方法 233.1 过滤方法及分析 233.1.1 字符串与字符串之间的过滤 233.1.2 数据子集与数据子集之间的过滤 243.1.3 字符串和数据子集之间的过滤 253.2 基于频率向量数据划分的算法 283.3 实验结果及分析 2

您可能关注的文档

文档评论(0)

1亿VIP精品文档

相关文档