数据采集综合实训要求 .docxVIP

下载本文档

37
0
约1.43万字
约 26页
2022-12-12 发布于重庆
举报

数据采集综合实训要求 .docx

PAGE 26 PAGE 数据采集综合实训报告题目：数据采集综合实训班级： 20大数据 4班姓名：张孝琪专业：大数据技术与应用学号： 2002381 重庆工商职业学院 2022年6月目录 TOC \o 1-3 \h \z \u 项目一 4 一、项目描述 4 二、项目分析 4 三、爬取实现过程 4 项目二 4 一、项目描述 4 二、项目分析 4 三、爬取实现过程 4 项目三 5 一、项目描述 5 二、项目分析 5 三、爬取实现过程 5 项目四 5 一、项目描述 5 二、项目分析 5 三、爬取实现过程 5 总结 6 项目一项目描述爬小说网的西游记小说 /book/xiyouji.html 正文1.5倍行距，字体小四，首行缩进两个字符二、项目分析环境配置基本环境配置略（Python、pip、pycharm环境配置正常） 1）、安装requests（发送请求，获取页面） pip install requests 2）、安装BeautifulSoup或lxml（分析页面，提取数据） pip install bs4或pip install lxml 获取页面 1)、获取链接 url = /book/xiyouji.html 2)、分析网站访问方式（采用get方式访问） html = session.get(url, headers=headers, verify=False).content.decode(utf-8) soup = BeautifulSoup(html, lxml) 提取数据（1）获取西游记的链接地址集合（F12） aLink = soup.find(id = content).find_all(a) （2）获取每一个西游记章节的链接地址 for i in aLink: dic = {} reLink = + str(i.attrs[href]) （3）获取每一个章节页面 link = requests.get(reLink, ).content.decode(utf-8) bs = BeautifulSoup(link, lxml) （4）解析页面，获取西游记数据（F12）保存数据 with open(xiyouji-1001.txt, a, encoding=utf-8) as xyj100: xyj100.write(mulu_list + \n) xyj100.write(dic[num] + \n) xyj100.write(dic[analysis] + \n) 三、爬取实现过程搭建环境软件：PyCharm 硬件：鼠标，键盘，显示屏实现步骤 import requests from bs4 import BeautifulSoup url = /book/xiyouji.html headers = { user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36} html = requests.get(url, headers=headers).content.decode(utf-8) soup = BeautifulSoup(html, lxml) print(爬取数据开始) title = soup.find(class_=card bookmark-list).h1.text time = soup.find(class_=card bookmark-list).find_all(p)[0].text author = soup.find(class_=card bookmark-list).find_all(p)[1].text content = soup.find(class_=card bookmark-list).find_all(p)[2].text soup.prettify() aLink = soup.find(class_=card bookmark-list).find_all(a) mulu = soup.find(class_=book-mulu).find_all(a) with open(xiyouji-1001.txt, a, encodi

您可能关注的文档

文档评论（0）

1亿VIP精品文档

更多 >

数据采集综合实训要求 .docxVIP