Jericho Html Parser使用介绍.doc

下载文档 降价啦

20
0
约5.49千字
约 6页
2016-12-03 发布于河南
举报
版权申诉
保障服务

Jericho Html Parser使用介绍.doc

1、本文档共6页，可阅读全部内容。
2、原创力文档（book118）网站文档一经付费（服务费），不意味着购买了该文档的版权，仅供个人/单位学习、研究之用，不得用于商业用途，未经授权，严禁复制、发行、汇编、翻译或者网络传播等，侵权必究。
3、本站所有内容均由合作方或网友上传，本站不对文档的完整性、权威性及其观点立场正确性做任何保证或承诺！文档内容仅供研究参考，付费前请自行鉴别。如您付费，意味着您自己接受本站规则且自行承担风险，本站不退款、不进行额外附加服务；查看《如何避免下载的几个坑》。如果您已付费下载过本站文档，您可以点击这里二次下载。
4、如文档侵犯商业秘密、侵犯著作权、侵犯人身权等，请点击“版权申诉”（推荐），也可以打举报电话：400-050-0827(电话支持时间：9:00-18:30)。

Jericho Html Parser使用介绍

Jericho Html Parser初探作者：SharpStill Jericho作为其SourceForge上人气最高的最新Html解析框架，自然有其强大的理由。但是由于目前中国人使用的不多，因此网上的中文教程和资料不多，所以造成了大家的学习困难。因此，我们从学习复杂度，代码量等初学者入门指标来看看这个框架的魔力吧。可以使用制作开源爬虫引擎。这个例子我们以淘宝这样的购物网站作为解析实例。淘宝网的页面分为 /go/chn/game,（类似album）和(类似video)和面还有许许多多这样的页面，我们利用Jericho Html Parser作为页面解析框架，来看一下他的威力。这个网页解析框架的xml书写如下： Jericho Html Parser的核心的类便是Source类，source类代表了html文档，他可以从URL得到文档或者从String得到。 In certain circumstances you may be able to improve performance by calling the fullSequentialParse() method before calling any tag search methods. See the documentation of the fullSequentialParse() method for details. 在其说明文档中有这样一句话，就是说如果在特定情况下可以使用fullSequentialParse()方法，提高解析速度，这个方法里的说明：Calling this method can greatly improve performance if most or all of the tags in the document need to be parsed. 如果在一个类里将大部分或者所有的tag标记都解析了的话，比如我们经常需要提取出网页所有的URL或者图片链接，在这种情况下使用这种方法可以加快提取速度，但是值得注意的一点是：只有在Source对象被new出来的后面一句紧接着调用这句话有效。紧接着调用Tag Search Method(文档中有详细说明)即可。我们以提取这个页面为例：这个页面包含以下几点：价格，运费信息，所在地区收藏人气宝贝类型package com.test.html; import java.util.List; import net.htmlparser.jericho.Element; import net.htmlparser.jericho.Source; import com.test.html.bean.ShoppingDetail; public class HtmlParseTest { public static ShoppingDetail extract(String inputHtml){ Source source = new Source(inputHtml); Element form = source.getElementById(J_FrmBid); ListElement inputArea = form.getAllElements(input); String price =; String area =; String transportInfo=; for(Element input : inputArea){ if(input.getAttributeValue(name).equals(buy_now))price = input.getAttributeValue(value); if(input.getAttributeValue(name).equals(region))area = input.getAttributeValue(value); if(input.getAttributeValue(name).equals(who_pay_ship))transportInfo = input.getAttributeValue(value); } Element others = source.getAllElementsByClass(other clearfix).get(0); String otherInfo = others.getContent().getTextExtractor().toString().trim(); int startBabyType =otherInfo.indexOf(宝贝类型：); i