Python 爬虫与网页解析

20 Jun 2018

ajax技术的核心是XMLHttpRequest对象(简称XHR)

创建一个XHR对象，也叫实例化一个XHR对象，因为XMLHTTPRequest()是一个构造函数。下面是创建XHR对象的兼容写法

	var xmlhttp;
	if (window.XMLHttpRequest)
  	{// code for IE7+, Firefox, Chrome, Opera, Safari
  		xmlhttp=new XMLHttpRequest();
  	}
	else
  	{// code for IE6, IE5
  		xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
  	}

[注意]如果要建立N个不同的请求，就要使用N个不同的XHR对象。当然可以重用已存在的XHR对象，但这会终止之前通过该对象挂起的任何请求

GET 请求

一个简单的 GET 请求：

xmlhttp.open("GET","demo_get.asp",true);	
xmlhttp.send();

GET 还是 POST？

与 POST 相比，GET 更简单也更快，并且在大部分情况下都能用。然而，在以下情况中，请使用 POST 请求：

无法使用缓存文件（更新服务器上的文件或数据库）
向服务器发送大量数据（POST 没有数据量限制）
发送包含未知字符的用户输入时，POST 比 GET 更稳定也更可靠

xmlhttp.open(“POST”,”ajax_test.asp”,true); xmlhttp.setRequestHeader(“Content-type”,”application/x-www-form-urlencoded”); xmlhttp.send(“fname=Bill&lname=Gates”);

Python-第三方库requests

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。Requests 的哲学是以 PEP 20 的习语为中心开发的，所以它比 urllib 更加 Pythoner。更重要的一点是它支持 Python3 ！

Beautiful is better than ugly.(美丽优于丑陋) Explicit is better than implicit.(清楚优于含糊) Simple is better than complex.(简单优于复杂) Complex is better than complicated.(复杂优于繁琐) Readability counts.(重要的是可读性)

Python3 解析 html

HTMLParser

python 自带了一个类，叫 HTMLParser。我们用的时候需要自己定义一个类，继承自 HTMLParser, 然后重写一部分方法:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>')

result:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

BeautifulSoup

BeautifulSoup 是一个用于解析 HTML 文档的 Python 库，通过 BeautifulSoup，你只需要用很少的代码就可以提取出 HTML 中任何感兴趣的内容，此外，它还有一定的 HTML 容错能力，对于一个格式不完整的HTML 文档，它也可以正确处理。

构建一个 BeautifulSoup 对象需要两个参数，第一个参数是将要解析的 HTML 文本字符串，第二个参数告诉 BeautifulSoup 使用哪个解析器来解析 HTML。”html.parser” 是Python内置的解析器，”lxml” 则是一个基于c语言开发的解析器，它的执行速度更快，不过它需要额外安装

soup = BeautifulSoup(text, "html.parser")

def bs4_paraser(html):
    all_value = []
    value = {}
    soup = BeautifulSoup(html, 'html.parser')
    # 获取影评的部分
    all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1)
    for row in all_div:
        # 获取每一个影评，即影评的item
        all_div_item = row.find_all('div', attrs={'class': 'item'})
        for r in all_div_item:
            # 获取影评的标题部分
            title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)
            if title is not None and len(title) > 0:
                value['title'] = title[0].a.string
                value['title_href'] = title[0].a['href']
                score_text = title[0].div.span.span['style']
                score_text = re.search(r'\d+', score_text).group()
                value['score'] = int(score_text) / 20
                # 时间
                value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string
                # 多少人喜欢
                value['people'] = int(
                        re.search(r'\d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())
            # print r
            all_value.append(value)
            value = {}
    return all_value

搜索文档树是通过指定标签名来搜索元素，另外还可以通过指定标签的属性值来精确定位某个节点元素，最常用的两个方法就是** find 和 find_all**。这两个方法在 BeatifulSoup 和 Tag 对象上都可以被调用。

找到所有class属性为big的p标签
>>> soup.find_all("p", "big")
[<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>]
等效于
>>> soup.find_all("p", class_="big")
[<p class="big"> \xb5\xda\xb6\xfe\xb8\xf6p\xb1\xea\xc7\xa9</p>] 上面的例子中的"class_"是因为不能和关键字"class"重名，获取属性可以用字典，保证不出错。

soup.find_all(attrs={'name':"big"})

requests-html

使用Python开发的同学一定听说过Requsts库，它是一个用于发送HTTP请求的测试。如比我们用Python做基于HTTP协议的接口测试，那么一定会首选Requsts，因为它即简单又强大。现在作者Kenneth Reitz 又开发了requests-html 用于做爬虫。

GiHub项目地址 Only Python 3.6 is supported.

requests-html 是基于现有的框架 PyQuery、Requests、lxml、beautifulsoup4等库进行了二次封装，作者将Requests设计的简单强大的优点带到了该项目中。

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://python.org/')

# 获取页面上的所有链接。
all_links =  r.html.links
print(all_links)

# 获取页面上的所有链接，以绝对路径的方式。
all_absolute_links = r.html.absolute_links
print(all_absolute_links)

取图片

from requests_html import HTMLSession
import requests


# 保存图片到bg/目录
def save_image(url, title):
    img_response = requests.get(url)
    with open('./bg/'+title+'.jpg', 'wb') as file:
        file.write(img_response.content)

# 背景图片地址，这里选择1920*1080的背景图片
url = "http://www.win4000.com/wallpaper_2358_0_10_1.html"

session = HTMLSession()
r = session.get(url)

# 查找页面中背景图，找到链接，访问查看大图，并获取大图地址
items_img = r.html.find('ul.clearfix > li > a')
for img in items_img:
    img_url = img.attrs['href']
    if "/wallpaper_detail" in img_url:
        r = session.get(img_url)
        item_img = r.html.find('img.pic-large', first=True)
        url = item_img.attrs['src']
        title = item_img.attrs['title']
        print(url+title)
        save_image(url, title)

参考：