分页交互在请求数据时有同步和异步两种情况,同步时页面整体刷新,异步时页面局部刷新。对于这两种分页的数据在进行爬虫时,处理的方式是不一样的。 demo仅供学习,域名全部匿为test
同步分页
同步分页时,页面整体刷新,url地址栏会发生变化
爬虫解析的数据对象是html
测试场景:抓取某招聘网站北京区的java职位
#coding=utf-8import scrapyclass testspider(scrapy.spider):
name=’test’
download_delay=3
user_agent=’mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/45.0.2454.101 safari/537.36′
page_url = ‘http://www.test.com/zhaopin/java/{0}/?filteroption=2’
page=1
#执行入口
def start_requests(self):
#第一页
yield scrapy.request(self.page_url.format(‘1’),
headers={‘user-agent’:self.user_agent},
callback=self.parse,
errback=self.errback_httpbin) #解析返回的数据
def parse(self,response):
for li in response.xpath(‘//*[@]/ul/li’): yield{ ‘company’:li.xpath(‘@data-company’).extract(), ‘salary’:li.xpath(‘@data-salary’).extract()
} #是否是最后一页,根据下一页的按钮css样式判断
if response.css(‘a.page_no.pager_next_disabled’):
print(‘—is the last page,stop!—‘)
pass
else:
self.page=self.page+1
#抓取下一页
yield scrapy.request(self.page_url.format(str(self.page)),
headers={‘user-agent’:self.user_agent},
callback=self.parse,
errback=self.errback_httpbin) #异常处理
def errback_httpbin(self,failure):
if failure.check(httperror):
response = failure.value.response print ‘httperror on {0}’.format(response.url) elif failure.check(dnslookuperror):
request = failure.request print’dnslookuperror on {0}’.format(request.url) elif failure.check(timeouterror, tcptimedouterror):
request = failure.request print’timeouterror on {0}’.format(request.url)
启动爬虫:scrapy runspider //spiders//test_spider.py -o test.csv 完毕后生成csv格式的文件:
异步分页
异步分页时,页面局部刷新,url地址栏不发生变化
爬虫解析的数据对象通常是json
测试场景:抓取某电影网站的经典电影前100
#coding=utf-8import scrapyimport jsonclass testspider(scrapy.spider):
name =’test’
download_delay = 3
user_agent = ‘mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/45.0.2454.101 safari/537.36’
pre_url = ‘https://movie.douban.com/j/search_subjects?type=movie&tag=%e7%bb%8f%e5%85%b8&sort=recommend&page_limit=20&page_start=’
page=0
cnt=0
def start_requests(self):
url= self.pre_url+str(0*20) yield scrapy.request(url,headers={‘user-agent’:self.user_agent},callback=self.parse) def parse(self,response):
if response.body: # json字符串转换成python对象
python_obj=json.loads(response.body)
subjects=python_obj[‘subjects’] if len(subjects)>0: for sub in subjects:
self.cnt=self.cnt+1
yield { ‘title’:sub[“title”], ‘rate’:sub[“rate”]
} if self.cnt