Scrapy 入门 | ━Start。平常心

1. 第一个爬虫

以下是官方文档的第一个爬虫例子。可以看到和我们手动使用request库和BeautifulSoup解析网页内容不同，Scrapy专门抽象了一个爬虫父类，我们只需要重写其中的方法，就可以迅速得到一个可以不断爬行的爬虫。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

上面的爬虫有几个地方需要解释一下：

爬虫类的name属性，用来标识爬虫，该名字在一个项目必须是唯一的。
start_requests()
方法，必须返回一个可迭代的列表（可以是列表，也可以是生成器），Scrapy会从这些请求开始抓取网页。
parse()
方法，用于从网页文本中抓取相应内容，我们需要根据自己的需要重写该方法。

在上面的例子中使用start_requests()方法来设置起始URL，如果只需要简单指定URL还可以使用另一种简便方法，那就是设置类属性start_urls，Scrapy会读取该属性来设置起始URL。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

运行爬虫

1 2	scrapy list # 查看可运行的爬虫 scrapy crawl quotes # 运行爬虫，根据上述代码，将爬取网页源代码到文件。

修改parse方法的内容，提取页面有效信息

def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').extract_first(),
            'author': quote.css('small.author::text').extract_first(),
            'tags': quote.css('div.tags a.tag::text').extract_first()
        }

存储提取的数据

1	scrapy crawl quotes -o quotes.json # 生成json文件，并将数据json序列化

但是，当多次执行上述命令时，不会覆盖原数据，会追加数据到文件中，因此会形成一个破损的json文件。
可以使用其他格式，如json行

1	scrapy crawl quotes -o quote1.jl

该JSON行格式是有用的，因为它的流状，你可以很容易地新记录追加到它。当您运行两次时，它没有JSON相同的问题。另外，由于每条记录都是一条独立的行，因此您可以处理大文件，而不必将所有内容都放在内存中，而像JQ这样的工具可以帮助在命令行执行该操作。

设置编码

如果不设置编码格式，会发现导出的所有汉字全变成了Unicode字符（类似\uA83B这样的）。自Scrapy1.2 起，增加了FEED_EXPORT_ENCODING属性，用于设置输出编码。我们在settings.py中添加下面的配置即可。

1	FEED_EXPORT_ENCODING = 'utf-8'

2. 页面跳转-爬虫

首先先获取下一页的链接。

1 2	response.css('ul.pager li.next a::attr(href)').extract_first() # 结果：'/page/2/'

获取页面跳转的爬虫

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    start_urls = ['http://quotes.toscrape.com/page/1/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract_first()
            }
        next_page = response.css('ul.pager li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
			# 直接获取到下一页的绝对url，yield一个新Request对象 
            yield scrapy.Request(next_page, callback=self.parse)

在函数体中，使用yield表达式，可以是函数成为一个生成器。当调用生成器函数时，它将返回一个称为生成器的迭代器
现在，在提取数据之后，该parse()方法查找到下一页的链接，使用该urljoin()方法构建一个完整的绝对URL （因为链接可以是相对的），并产生一个新的请求到下一个页面，注册为回调来处理提取下一页的数据并保持所有页面的爬行。

也可以用response.follow()来获取链接。传入的对象只能是str或selector，不能使SelectorList

# 方式一：不用获取到绝对的url，使用follow方法会自动帮我们实现 
next_page = response.css('li.next a::attr(href)').extract_first()
# next_page = '/page/2/'
if next_page is not None:
     yield response.follow(next_page, callback=self.parse)

# 方式二：只需要传入href这个selector 
next_page = response.css('li.next a::attr(href)')[0]
# <Selector xpath="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' next ')]/descendant-or-self::*/a/@href" data='/page/2/'>
if next_page is not None:
     yield response.follow(next_page, callback=self.parse)

# 方式三:传递一个a的selector，follow方法自动会提取href
next_page = response.css('li.next a')[0]
# <Selector xpath="descendant-or-self::li[@class and contains(concat(' ', normalize-space(@class), ' '), ' next ')]/descendant-or-self::*/a" data='<a href="/page/2/">Next <span aria-hidde'>
if next_page is not None:
      yield response.follow(next_page, callback=self.parse)

使用Splider参数

-a 运行时，可以使用该选项向Splider提供命令行参数
这些参数被传递给Splider的_ init _方法，并默认成为蜘蛛属性。
在这个例子中，为参数提供的值tag将可以通过self.tag。你可以使用它来让你的蜘蛛只用特定的标签来获取引号，根据参数构建URL：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)