python用于url解码和中文解析的小脚本(pythonurldecoder)

代码如下:

# -*- coding: utf8 -*- #! python print(repr(“测试报警，xxxx是大猪头”.decode(“utf8”).encode(“gbk”)).replace(“\\x”,”%”))

注意第一个 decode(“utf8”) 要与文件声明的编码一样。

最开始对这个问题的接触，来自于一个javascript解谜闯关的小游戏，某一关的提示如下：

刚开始的几关都是很简单很简单的哦～～这一关只是简单的字符串变形而已…..

后面是一大长串开头是%5cu4e0b%5cu4e00%5cu5173%5cu7684这样的字符串。这种东西以前经常在浏览器的地址栏见到，就是一直不知道怎么转换成能看懂的东东，网上google了一下，结合python的url解码和unicode解码，解决方式如下:

代码如下:

import urllib escaped_str=”%5cu4e0b%5cu4e00%5cu5173%5cu7684%5cu9875%5cu9762%5cu540d%5cu5b57%5cu662f%5cx20%5cx69%5cx32%5cx6a%5cx62%5cx6a%5cx33%5cx69%5cx34%5cx62%5cx62%5cx35%5cx34%5cx62%5cx35%5cx32%5cx69%5cx62%5cx33%5cx2e%5cx68%5cx74%5cx6d”print urllib.unquote(escaped_str).decode(‘unicode-escape’)

最近，我对firefox的autoproxy插件中的gfwlist中的中文词汇（用过代理的同学们，你们懂的）产生了兴趣，然而这些网址都是用url编码的，比如http://zh.wikipedia.org/wiki/%e9%97%a8，需要使用正则表达式将被url编码的中文字符提取出来，写了个小脚本如下：

代码如下:

import urllib import re with open(“listfile”,”r”) as f: for url_str in f: match=re.compile(“((%\w{2}){3,})”).findall(url_str) #汉字url编码的样式是：百分号+2个十六进制数，重复3次 if match!=none: #如果匹配成功，则将提取出的部分转换为中文 for trans in match: print urllib.unquote(trans[0]),

然而这个脚本仍有一些缺点，对于列表文件中的某些中文字符仍然不能正常解码，比如下面这几行测试代码

代码如下:

import urllib a=”http://zh.wikipedia.org/wiki/%bd%f0%b6″b=”http://zh.wikipedia.org/wiki/%e9%97%a8″de=urllib.unquote print de(a),de(b)

输出结果就是前者可以正确解码，而后者不可以，个人觉得原因可能和big5编码有关，如果谁知道什么解决办法，还请告诉我一下~以下是补充：de(a).decode(“gbk”,”ignore”)de(b).decode(“utf8″,”ignore”)

這樣你可以得到這些字串的unicode編碼。

你用的unquote不是decoder, 你需要作必要的decode和encode。我一直用utf8作我默認環境的，我覺得你大概用的gbk吧，所以後者的解碼你那邊失敗了。猜編碼是很累的事情，如果大家都用utf8倒也好，但是有些人習慣了gb。http://yac163.svn.sourceforge.net/viewvc/yac163/trunk/yac163-nox/pic.py?revision=198&view=markup

參考我這個很古老code裡面的#102-147行給每個decode和encode調用加上(…,”ignore”)。

代码如下:

def strdecode( string,charset=none ): if isinstance(string,unicode): return string if charset: try: return string.decode(charset) except unicodedecodeerror: return _strdecode(string) else: return _strdecode(string)

def _strdecode(string): try:

return string.decode(‘utf8’) except unicodedecodeerror: try: return string.decode(‘gb2312’) except unicodedecodeerror: try:

return string.decode(‘gbk’) except unicodedecodeerror: return string.decode(‘gb18030’)

def strencode( string,charset=none ): if isinstance(string,str): return string if charset: try: return string.encode(charset) except unicodeencodeerror: return _strencode(string) else: return _strencode(string) def _strencode(string):

try: return string.encode(‘utf8’) except unicodeencodeerror: try: return string.encode(‘gb2312’) except unicodeencodeerror: try: return string.encode(‘gbk’) except unicodeencodeerror: return string.encode(‘gb18030’)

发表评论 取消回复

发表评论取消回复