python的scrapy爬虫框架简单学习笔记

一、简单配置,获取单个网页上的内容。
(1)创建scrapy项目

scrapy startproject getblog

(2)编辑 items.py

# -*- coding: utf-8 -*-
# define here the models for your scraped items
#
# see documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import item, field
class blogitem(item):
title = field()
desc = field()

(3)在 spiders 文件夹下,创建 blog_spider.py

需要熟悉下xpath选择,感觉跟jquery选择器差不多,但是不如jquery选择器用着舒服( w3school教程: http://www.w3school.com.cn/xpath/ )。

# coding=utf-8
from scrapy.spider import spider
from getblog.items import blogitem
from scrapy.selector import selector
class blogspider(spider):
# 标识名称
name = ‘blog’
# 起始地址
start_urls = [‘http://www.cnblogs.com/’]
def parse(self, response):
sel = selector(response) # xptah 选择器
# 选择所有含有class属性,值为‘post_item’的p 标签内容
# 下面的 第2个p 的 所有内容
sites = sel.xpath(‘//p[@]/p[2]’)
items = []
for site in sites:
item = blogitem()
# 选取h3标签下,a标签下,的文字内容 ‘text()’
item[‘title’] = site.xpath(‘h3/a/text()’).extract()
# 同上,p标签下的 文字内容 ‘text()’
item[‘desc’] = site.xpath(‘p[@]/text()’).extract()
items.append(item)
return items

(4)运行,

scrapy crawl blog # 即可

(5)输出文件。

在 settings.py 中进行输出配置。

# 输出文件位置
feed_uri = ‘blog.xml’
# 输出文件格式 可以为 json,xml,csv
feed_format = ‘xml’

输出位置为项目根文件夹下。

二、基本的 — scrapy.spider.spider

(1)使用交互shell

dizzy@dizzy-pc:~$ scrapy shell “http://www.baidu.com/”

2014-08-21 04:09:11+0800 [scrapy] info: scrapy 0.24.4 started (bot: scrapybot)
2014-08-21 04:09:11+0800 [scrapy] info: optional features available: ssl, http11, django
2014-08-21 04:09:11+0800 [scrapy] info: overridden settings: {‘logstats_interval’: 0}
2014-08-21 04:09:11+0800 [scrapy] info: enabled extensions: telnetconsole, closespider, webservice, corestats, spiderstate
2014-08-21 04:09:11+0800 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats
2014-08-21 04:09:11+0800 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware
2014-08-21 04:09:11+0800 [scrapy] info: enabled item pipelines:
2014-08-21 04:09:11+0800 [scrapy] debug: telnet console listening on 127.0.0.1:6024
2014-08-21 04:09:11+0800 [scrapy] debug: web service listening on 127.0.0.1:6081
2014-08-21 04:09:11+0800 [default] info: spider opened
2014-08-21 04:09:12+0800 [default] debug: crawled (200) (referer: none)
[s] available scrapy objects:
[s] crawler
[s] item {}
[s] request
[s] response
[s] settings
[s] spider
[s] useful shortcuts:
[s] shelp() shell help (print this help)
[s] fetch(req_or_url) fetch request (or url) and update local objects
[s] view(response) view response in a browser
>>>
# response.body 返回的所有内容
# response.xpath(‘//ul/li’) 可以测试所有的xpath内容
more important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()

也就是可以很方便的,以交互的形式来查看xpath选择是否正确。之前是用firefox的f12来选择的,但是并不能保证每次都能正确的选择出内容。

也可使用:

scrapy shell ‘http://scrapy.org’ –nolog
# 参数 –nolog 没有日志

(2)示例

from scrapy import spider
from scrapy_test.items import dmozitem
class dmozspider(spider):
name = ‘dmoz’
allowed_domains = [‘dmoz.org’]
start_urls = [‘http://www.dmoz.org/computers/programming/languages/python/books/’,
‘http://www.dmoz.org/computers/programming/languages/python/resources/,’
”]
def parse(self, response):
for sel in response.xpath(‘//ul/li’):
item = dmozitem()
item[‘title’] = sel.xpath(‘a/text()’).extract()
item[‘link’] = sel.xpath(‘a/@href’).extract()
item[‘desc’] = sel.xpath(‘text()’).extract()
yield item

(3)保存文件

可以使用,保存文件。格式可以 json,xml,csv

scrapy crawl -o ‘a.json’ -t ‘json’

(4)使用模板创建spider

scrapy genspider baidu baidu.com
# -*- coding: utf-8 -*-
import scrapy
class baiduspider(scrapy.spider):
name = “baidu”
allowed_domains = [“baidu.com”]
start_urls = (
‘http://www.baidu.com/’,
)
def parse(self, response):
pass

这段先这样吧,记得之前5个的,现在只能想起4个来了. 🙁

千万记得随手点下保存按钮。否则很是影响心情的(⊙o⊙)!

三、高级 — scrapy.contrib.spiders.crawlspider

例子

#coding=utf-8
from scrapy.contrib.spiders import crawlspider, rule
from scrapy.contrib.linkextractors import linkextractor
import scrapy
class testspider(crawlspider):
name = ‘test’
allowed_domains = [‘example.com’]
start_urls = [‘http://www.example.com/’]
rules = (
# 元组
rule(linkextractor(allow=(‘category\.php’, ), deny=(‘subsection\.php’, ))),
rule(linkextractor(allow=(‘item\.php’, )), callback=’pars_item’),
)
def parse_item(self, response):
self.log(‘item page : %s’ % response.url)
item = scrapy.item()
item[‘id’] = response.xpath(‘//td[@]/text()’).re(‘id:(\d+)’)
item[‘name’] = response.xpath(‘//td[@]/text()’).extract()
item[‘description’] = response.xpath(‘//td[@]/text()’).extract()
return item

其他的还有 xmlfeedspider

class scrapy.contrib.spiders.xmlfeedspider
class scrapy.contrib.spiders.csvfeedspider
class scrapy.contrib.spiders.sitemapspider

四、选择器

>>> from scrapy.selector import selector
>>> from scrapy.http import htmlresponse

可以灵活的使用 .css() 和 .xpath() 来快速的选取目标数据

关于选择器,需要好好研究一下。xpath() 和 css() ,还要继续熟悉 正则.

当通过class来进行选择的时候,尽量使用 css() 来选择,然后再用 xpath() 来选择元素的熟悉

五、item pipeline

typical use for item pipelines are:
• cleansing html data # 清除html数据
• validating scraped data (checking that the items contain certain fields) # 验证数据
• checking for duplicates (and dropping them) # 检查重复
• storing the scraped item in a database # 存入数据库
(1)验证数据

from scrapy.exceptions import dropitem
class pricepipeline(object):
vat_factor = 1.5
def process_item(self, item, spider):
if item[‘price’]:
if item[‘price_excludes_vat’]:
item[‘price’] *= self.vat_factor
else:
raise dropitem(‘missing price in %s’ % item)

(2)写json文件

import json
class jsonwriterpipeline(object):
def __init__(self):
self.file = open(‘json.jl’, ‘wb’)
def process_item(self, item, spider):
line = json.dumps(dict(item)) + ‘\n’
self.file.write(line)
return item

(3)检查重复

from scrapy.exceptions import dropitem
class duplicates(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item[‘id’] in self.ids_seen:
raise dropitem(‘duplicate item found : %s’ % item)
else:
self.ids_seen.add(item[‘id’])
return item

至于将数据写入数据库,应该也很简单。在 process_item 函数中,将 item 存入进去即可了。

Posted in 未分类

发表评论