信息检索技术概述.ppt

下载文档 降价啦

13
0
约6.41千字
约 26页
2017-03-10 发布于天津
举报
版权申诉
保障服务

信息检索技术概述.ppt

1、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。。
2、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
3、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

信息检索技术概述

Perceptron function: o(X) = sgn(W * X) all boolean functions can be represented by some network of perceptrons only 2 levels deep!!! Learning a perceptron function involves choosing proper W begin with a random W, then iteratively apply the perceptron to each training example, modifying W whenever it mis-classifies an example, repeat this until the perceptron classifies all examples correctly; Training rules: wi = wi + Δwi where Δwi = λ (t-o) xi , t and o the target and real output, λ a learning rate Gradient descent and delta rule Multi-layered network learned by BackPropagation algorithm can express a rich variety of non-linear decision surfaces Sigmoid unit is a differentiable threshold unit: o = σ (W * X) where σ (Y) = 1/(1+eY) learns the weights for a multi-layered network uses gradient descent to attempt to minimize the squared error between the target and real outputs searches the space of possible hypotheses to iteratively reduces the error in the network fit to the training examples can invent new features that are not explicit in the input to the network Widely used and very successful Be aware of over-fitting!!! * * * * * * * * 信息检索技术概述基本概念衡量信息检索技术的指标检索策略向量空间模型提高检索效率的各种技术途径跨语言检索问题定义/概念在用户提出查询要求之前对一组静态的或接近静态的文件建立索引用户提出查询要求将一组与用户查询相关的文件按照它们与该查询的相似程度排列，并将结果提供给用户信息检索（Information Retrieval ，简称IR)不是去简单地寻找相匹配的模式，而是希望找到相关的文件衡量指标有效性（Effectiveness） – 如何按照与用户查询的相关程度对文件进行排序效率/高效性（Efficiency） – 如何更快地讲文件排序度量效率的两个指标：精度（Precision) – relevant retrieved / retrieved 准确度（Recall）– relevant retrieved / relevant 检索策略各种不同的策略都会对文件和查询要求间的相似程度进行度量各种策略的共同出发点都是：如果发现在查询要求和文件中同时出现的项（词汇）越多，即认为该文件和该查询要求越相关检索策略是一个算法，当它收到一个查询请求Q以及一组文件D1,D2..Dn时，它应计算出其中每个文件Di和查询请求Q的相似系数（similarity coefficient) SC(Q,Di) 最常用的检索策略：向量空间模型向量空间模型基于文件的内容是通过它所使用的单词表达的若文件内容和查询内容越相似，就认为该文件和该查询越相似为每个文件定义一个向量，同理也为查询请求定义一个向量通常以两个向量的内积计算他们的相似系数向量空间模型 (2) 常采用 tf/idf 算法！简单！ t – 在文件组中出现的不同项（单词）的数目 tfij – 项 tj 在文件 Di中出现的次数 dfj – 文件组中包含项 tj的文件的数量向量空间模型 (3) idf = l