- 6
- 0
- 约7.09千字
- 约 52页
- 2017-10-02 发布于广东
- 举报
文本挖掘(textmining)技术基础
向量相似度算法 余弦相似性(cosine-based similarity) 相关相似性(Pearson相关系数 ) 修正的余弦相似性(adjusted-cosine similarity) * 文档相似性 其中: Di为文档i Wij是第i个特征项在第j个文档向量中的权值 * Vector Space Model * 向量空间模型例子 * 摘自:http://bit.ly/cbDyIK Inverted Files Inverted Files Word-Level Inverted File In Lucene, a TermFreqVector is a representation of all of the terms and term counts in a specific Field of a Document instance As a tuple: termFreq = term, term countD fieldName, …,termFreqi, termFreqi+1,… As Java: public String getField(); public String[] getTerms(); public int[] getTermFrequencies(); Lucene Term Vectors (TV) Parallel Arr
原创力文档

文档评论(0)