python网络编程学习笔记(七)：html和xhtml解析(htmlparser、beautifulsoup)

一、利用htmlparser进行网页解析具体htmlparser官方文档可参考http://docs.python.org/library/htmlparser.html#htmlparser.htmlparser

1、从一个简单的解析例子开始例1： test1.html文件内容如下：

代码如下:

xhtml 与 html 4.01 标准没有太多的不同 i love you

下面是能够列出title和body的程序示例：

代码如下:

##@小五义：##htmlparser示例 import htmlparser class titleparser(htmlparser.htmlparser): def __init__(self): self.taglevels=[] self.handledtags=[‘title’,’body’] #提出标签 self.processing=none htmlparser.htmlparser.__init__(self) def handle_starttag(self,tag,attrs): if tag in self.handledtags: self.data=” self.processing=tag def handle_data(self,data): if self.processing: self.data +=data def handle_endtag(self,tag): if tag==self.processing: print str(tag)+’:’+str(tp.gettitle()) self.processing=none def gettitle(self): return self.data fd=open(‘test1.html’) tp=titleparser() tp.feed(fd.read())

运行结果如下： title: xhtml 与 html 4.01 标准没有太多的不同 body: i love you 程序定义了一个titleparser类，它是htmlparser类的子孙。htmlparser的feed方法将接收数据，并通过定义的htmlparser对象对数据进行相应的解析。其中handle_starttag、handle_endtag判断起始和终止tag，handle_data检查是否取得数据，如果self.processing不为none，那么就取得数据。

2、解决html实体问题（html 中有用的字符实体）（1）实体名称当与到html中的实体问题时，上面的例子就无法实现，如这里将test1.html的代码改为：例2：

代码如下:

xhtml 与” html 4.01 “标准没有太多的不同 i love you×

利用上面的例子进行分析，其结果是： title: xhtml 与 html 4.01 标准没有太多的不同 body: i love you 实体完全消失了。这是因为当出现实体的时候，htmlparser调用了handle_entityref()方法，因为代码中没有定义这个方法，所以就什么都没有做。经过修改后，如下：

代码如下:

##@小五义：##htmlparser示例：解决实体问题 from htmlentitydefs import entitydefs import htmlparser class titleparser(htmlparser.htmlparser): def __init__(self): self.taglevels=[] self.handledtags=[‘title’,’body’] self.processing=none htmlparser.htmlparser.__init__(self) def handle_starttag(self,tag,attrs): if tag in self.handledtags: self.data=” self.processing=tag def handle_data(self,data): if self.processing: self.data +=data def handle_endtag(self,tag): if tag==self.processing: print str(tag)+’:’+str(tp.gettitle()) self.processing=none def handle_entityref(self,name): if entitydefs.has_key(name): self.handle_data(entitydefs[name]) else: self.handle_data(‘&’+name+’;’) def gettitle(self): return self.data fd=open(‘test1.html’) tp=titleparser() tp.feed(fd.read())

运行结果为： title: xhtml 与” html 4.01 “标准没有太多的不同 body: i love you× 这里就把所有的实体显示出来了。

（2）实体编码例3：

代码如下:

xhtml 与” html 4.01 “标准没有太多的不同 i love÷ you×

如果利用例2的代码执行后结果为： title: xhtml 与” html 4.01 “标准没有太多的不同 body: i love you× 结果中÷ 对应的÷没有显示出来。添加handle_charref（）进行处理，具体代码如下：

代码如下:

def handle_charref(self,name): try: charnum=int(name) except valueerror: return if charnum255: return self.handle_data(chr(charnum))

def gettitle(self): return self.data fd=open(‘test1.html’) tp=titleparser() tp.feed(fd.read())

运行结果为： title: xhtml 与” html 4.01 “标准没有太多的不同 body: i love÷ you×

3、提取链接例4：

代码如下:

xhtml 与” html 4.01 “标准没有太多的不同

i love÷ you×

这里在handle_starttag(self,tag,attrs)中，tag=a时，attrs记录了属性值，因此只需要将attrs中name=href的value提出即可。具体如下：

代码如下:

##@小五义：##htmlparser示例：提取链接 # -*- coding: cp936 -*- from htmlentitydefs import entitydefs import htmlparser class titleparser(htmlparser.htmlparser): def __init__(self): self.taglevels=[] self.handledtags=[‘title’,’body’] self.processing=none htmlparser.htmlparser.__init__(self) def handle_starttag(self,tag,attrs): if tag in self.handledtags: self.data=” self.processing=tag if tag ==’a’: for name,value in attrs: if name==’href’: print ‘连接地址：’+value def handle_data(self,data): if self.processing: self.data +=data def handle_endtag(self,tag): if tag==self.processing: print str(tag)+’:’+str(tp.gettitle()) self.processing=none def handle_entityref(self,name): if entitydefs.has_key(name): self.handle_data(entitydefs[name]) else: self.handle_data(‘&’+name+’;’)

def handle_charref(self,name): try: charnum=int(name) except valueerror: return if charnum255: return self.handle_data(chr(charnum))

def gettitle(self): return self.data fd=open(‘test1.html’) tp=titleparser() tp.feed(fd.read())

运行结果为： title: xhtml 与” html 4.01 “标准没有太多的不同连接地址：http://pypi.python.org/pypi body:

i love÷ you×

4、提取图片如果网页中有一个图片文件，将其提取出来，并存为一个单独的文件。例5：

代码如下:

xhtml 与” html 4.01 “标准没有太多的不同 i love÷ you× 我想你

将baidu_sylogo1.gif存取出来，具体代码如下：

代码如下:

##@小五义：##htmlparser示例：提取图片 # -*- coding: cp936 -*- from htmlentitydefs import entitydefs import htmlparser,urllib def getimage(addr):#提取图片并存在当前目录下 u = urllib.urlopen(addr) data = u.read() filename=addr.split(‘/’)[-1] f=open(filename,’wb’) f.write(data) f.close() print filename+’已经生成！’

class titleparser(htmlparser.htmlparser): def __init__(self): self.taglevels=[] self.handledtags=[‘title’,’body’] self.processing=none htmlparser.htmlparser.__init__(self) def handle_starttag(self,tag,attrs): if tag in self.handledtags: self.data=” self.processing=tag if tag ==’a’: for name,value in attrs: if name==’href’: print ‘连接地址：’+value if tag==’img’: for name,value in attrs: if name==’src’: getimage(value) def handle_data(self,data): if self.processing: self.data +=data def handle_endtag(self,tag): if tag==self.processing: print str(tag)+’:’+str(tp.gettitle()) self.processing=none def handle_entityref(self,name): if entitydefs.has_key(name): self.handle_data(entitydefs[name]) else: self.handle_data(‘&’+name+’;’)

def handle_charref(self,name): try: charnum=int(name) except valueerror: return if charnum255: return self.handle_data(chr(charnum))

def gettitle(self): return self.data fd=open(‘test1.html’) tp=titleparser() tp.feed(fd.read())

运动结果为： title: xhtml 与” html 4.01 “标准没有太多的不同连接地址：http://pypi.python.org/pypi baidu_sylogo1.gif已经生成！ body: i love÷ you× ?ò????

5、实际例子：例6、获取人人网首页上的各各链接地址，代码如下：

代码如下:

##@小五义：##htmlparser示例：获取人人网首页上的各各链接地址 #coding: utf-8 from htmlentitydefs import entitydefs import htmlparser,urllib def getimage(addr): u = urllib.urlopen(addr) data = u.read() filename=addr.split(‘/’)[-1] f=open(filename,’wb’) f.write(data) f.close() print filename+’已经生成！’ class titleparser(htmlparser.htmlparser): def __init__(self): self.taglevels=[] self.handledtags=[‘a’] self.processing=none self.linkstring=” self.linkaddr=” htmlparser.htmlparser.__init__(self) def handle_starttag(self,tag,attrs): if tag in self.handledtags: for name,value in attrs: if name==’href’: self.linkaddr=value self.processing=tag

def handle_data(self,data): if self.processing: self.linkstring +=data #print data.decode(‘utf-8′)+’:’+self.linkaddr def handle_endtag(self,tag): if tag==self.processing: print self.linkstring.decode(‘utf-8′)+’:’+self.linkaddr self.processing=none self.linkstring=” def handle_entityref(self,name): if entitydefs.has_key(name): self.handle_data(entitydefs[name]) else: self.handle_data(‘&’+name+’;’)

def handle_charref(self,name): try: charnum=int(name) except valueerror: return if charnum255: return self.handle_data(chr(charnum))

def gettitle(self): return self.linkaddr tp=titleparser() tp.feed(urllib.urlopen(‘http://www.renren.com/’).read())

运行结果：分享:http://share.renren.com 应用程序:http://app.renren.com 公共主页:http://page.renren.com 人人生活:http://life.renren.com 人人小组:http://xiaozu.renren.com/ 同名同姓:http://name.renren.com 人人中学:http://school.renren.com/allpages.html 大学百科:http://school.renren.com/daxue/ 人人热点:http://life.renren.com/hot 人人小站:http://zhan.renren.com/ 人人逛街:http://j.renren.com/ 人人校招:http://xiaozhao.renren.com/ :http://www.renren.com 注册:http://wwv.renren.com/xn.do?ss=10113&rt=27 登录:http://www.renren.com/ 帮助:http://support.renren.com/helpcenter 给我们提建议:http://support.renren.com/link/suggest 更多:# :javascript:closeerror(); 打开邮箱查收确认信:# 重新输入:javascript:closeerror(); :javascript:closestop(); 客服:http://help.renren.com/#http://help.renren.com/support/contomvice?pid=2&selection={couid:193,proid:342,cityid:1000375} :javascript:closelock(); 立即解锁:http://safe.renren.com/relive.do 忘记密码？:http://safe.renren.com/findpass.do 忘记密码？:http://safe.renren.com/findpass.do 换一张:javascript:refreshcode_login(); msn:# 360:https://openapi.360.cn/oauth2/authorize?client_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://www.renren.com/bind/tsz/tszlogincallback&scope=basic&display=default 天翼:https://oauth.api.189.cn/emp/oauth2/authorize?app_id=296961050000000294&response_type=code&redirect_uri=http://www.renren.com/bind/ty/tylogincallback 为什么要填写我的生日？:#birthday 看不清换一张?:javascript:refreshcode(); 想了解更多人人网功能？点击此处:javascript:; :javascript:; :javascript:; 立刻注册:http://reg.renren.com/xn6245.do?ss=10113&rt=27 关于:http://www.renren.com/siteinfo/about 开放平台:http://dev.renren.com 人人游戏:http://wan.renren.com 公共主页:http://page.renren.com/register/regguide/ 手机人人:http://mobile.renren.com/mobilelink.do?psf=40002 团购:http://www.nuomi.com 皆喜网:http://www.jiexi.com 营销服务:http://ads.renren.com 招聘:http://job.renren-inc.com/ 客服帮助:http://support.renren.com/helpcenter 隐私:http://www.renren.com/siteinfo/privacy 京icp证090254号:http://www.miibeian.gov.cn/ 互联网药品信息服务资格证:http://a.xnimg.cn/n/core/res/certificate.jpg

二、利用beautifulsoup进行网页解析 1、beautifulsoup下载和安装下载地址：http://www.crummy.com/software/beautifulsoup/download/3.x/ 中文文档地址：http://www.crummy.com/software/beautifulsoup/bs3/documentation.zh.html#entity%20conversion 安装方法：将下载的文件解压缩后，文件夹下有个setup.py文件，然后在cmd下，运行python setup.py install进行安装，注意setup.py的路径问题。安装成功后，在python中就可以直接import beautifulsoup了。 2、从一个简单的解析例子开始例7：

代码如下:

xhtml 与” html 4.01 “标准没有太多的不同 i love÷ you× 我想你

获取title的代码：

代码如下:

##@小五义：##beautifulsoup示例：title #coding: utf8 import beautifulsoup

a=open(‘test1.html’,’r’) htmlline=a.read() soup=beautifulsoup.beautifulsoup(htmlline.decode(‘gb2312’)) #print soup.prettify()#规范化html文件 titletag=soup.html.head.title print titletag.string

运行结果： xhtml 与” html 4.01 “标准没有太多的不同从代码和结果来看，应注意两点：第一，在beautifulsoup.beautifulsoup(htmlline.decode(‘gb2312’))初始化过程中，应注意字符编码格式，从网上搜索了一下，开始用utf-8的编码显示不正常，换为gb2312后显示正常。其实可以用soup.originalencoding方法来查看原文件的编码格式。第二，结果中未对字符实体进行处理，在beautifulsoup中文文档中，有专门对实体转换的解释，这里将上面的代码改为以下代码后，结果将正常显示：

代码如下:

##@小五义：##beautifulsoup示例：title #coding: utf8 import beautifulsoup a=open(‘test1.html’,’r’) htmlline=a.read() soup=beautifulsoup.beautifulstonesoup(htmlline.decode(‘gb2312’),convertentities=beautifulsoup.beautifulstonesoup.all_entities) #print soup.prettify()#规范化html文件 titletag=soup.html.head.title print titletag.string

这里convertentities=beautifulsoup.beautifulstonesoup.all_entities中的all_entities定义了xml和html两者的实体代码。当然，也可以直接用xml_entities或者html_entities。运行结果如下： xhtml 与” html 4.01 “标准没有太多的不同 3、提取链接还有用上面的例子，这里代码变为：

代码如下:

##@小五义：##beautifulsoup示例：提取链接 #coding: utf8 import beautifulsoup a=open(‘test1.html’,’r’) htmlline=a.read() a.close() soup=beautifulsoup.beautifulstonesoup(htmlline.decode(‘gb2312’),convertentities=beautifulsoup.beautifulstonesoup.all_entities) name=soup.find(‘a’).string links=soup.find(‘a’)[‘href’] print name+’:’+links

运行结果为：我想你:http://pypi.python.org/pypi 4、提取图片依然是用上面的例子，把baidu图片提取出来。代码为：

代码如下:

##@小五义：http://www.cnblogs.com/xiaowuyi#coding: utf8 import beautifulsoup,urllib def getimage(addr):#提取图片并存在当前目录下 u = urllib.urlopen(addr) data = u.read() filename=addr.split(‘/’)[-1] f=open(filename,’wb’) f.write(data) f.close() print filename+’ finished!’ a=open(‘test1.html’,’r’) htmlline=a.read() soup=beautifulsoup.beautifulstonesoup(htmlline.decode(‘gb2312’),convertentities=beautifulsoup.beautifulstonesoup.all_entities) links=soup.find(‘img’)[‘src’] getimage(links)

提取链接和提取图片两部分主要都是用了find方法，具体方法为： find(name, attrs, recursive, text, **kwargs) findall是列出全部符合条件的，find只列出第一条。这里注意的是findall返回的是个list。 5、实际例子：例8、获取人人网首页上的各各链接地址，代码如下：

代码如下:

##@小五义：##beautifulsoup示例：获取人人网首页上的各各链接地址 #coding: utf8 import beautifulsoup,urllib linkname=” htmlline=urllib.urlopen(‘http://www.renren.com/’).read() soup=beautifulsoup.beautifulstonesoup(htmlline.decode(‘utf-8’)) links=soup.findall(‘a’) for i in links: ##判断tag是a的里面，href是否存在。 if ‘href’ in str(i): linkname=i.string linkaddr=i[‘href’] if ‘nonetype’ in str(type(linkname)):#当i无内容是linkname为nonetype类型。 print linkaddr else: print linkname+’:’+linkaddr

运行结果：分享:http://share.renren.com 应用程序:http://app.renren.com 公共主页:http://page.renren.com 人人生活:http://life.renren.com 人人小组:http://xiaozu.renren.com/ 同名同姓:http://name.renren.com 人人中学:http://school.renren.com/allpages.html 大学百科:http://school.renren.com/daxue/ 人人热点:http://life.renren.com/hot 人人小站:http://zhan.renren.com/ 人人逛街:http://j.renren.com/ 人人校招:http://xiaozhao.renren.com/ http://www.renren.com 注册:http://wwv.renren.com/xn.do?ss=10113&rt=27 登录:http://www.renren.com/ 帮助:http://support.renren.com/helpcenter 给我们提建议:http://support.renren.com/link/suggest 更多:# javascript:closeerror(); 打开邮箱查收确认信:# 重新输入:javascript:closeerror(); javascript:closestop(); 客服:http://help.renren.com/#http://help.renren.com/support/contomvice?pid=2&selection={couid:193,proid:342,cityid:1000375} javascript:closelock(); 立即解锁:http://safe.renren.com/relive.do 忘记密码？:http://safe.renren.com/findpass.do 忘记密码？:http://safe.renren.com/findpass.do 换一张:javascript:refreshcode_login(); msn:# 360:https://openapi.360.cn/oauth2/authorize?client_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://www.renren.com/bind/tsz/tszlogincallback&scope=basic&display=default 天翼:https://oauth.api.189.cn/emp/oauth2/authorize?app_id=296961050000000294&response_type=code&redirect_uri=http://www.renren.com/bind/ty/tylogincallback #birthday 看不清换一张?:javascript:refreshcode(); javascript:; javascript:; 立刻注册:http://reg.renren.com/xn6245.do?ss=10113&rt=27 关于:http://www.renren.com/siteinfo/about 开放平台:http://dev.renren.com 人人游戏:http://wan.renren.com 公共主页:http://page.renren.com/register/regguide/ 手机人人:http://mobile.renren.com/mobilelink.do?psf=40002 团购:http://www.nuomi.com 皆喜网:http://www.jiexi.com 营销服务:http://ads.renren.com 招聘:http://job.renren-inc.com/ 客服帮助:http://support.renren.com/helpcenter 隐私:http://www.renren.com/siteinfo/privacy 京icp证090254号:http://www.miibeian.gov.cn/ 互联网药品信息服务资格证:http://a.xnimg.cn/n/core/res/certificate.jpg

发表评论 取消回复

发表评论取消回复