ç¼ç é®é¢å 为æ¶åå°ä¸æï¼æä»¥å¿ ç¶å°æ¶åå°äºç¼ç çé®é¢ï¼è¿ä¸æ¬¡åè¿ä¸ªæºä¼ç®æ¯å½»åºææ¸ æ¥äºãé®é¢è¦ä»æåçç¼ç 讲起ãåæ¬çè±æç¼ç åªæ0~255ï¼å好æ¯8ä½1个åèã为äºè¡¨ç¤ºåç§ä¸åçè¯è¨ï¼èªç¶è¦è¿è¡æ©å ãä¸æçè¯ægbç³»åãå¯è½è¿å¬è¯´è¿unicodeåutf-8ï¼é£ä¹ï¼å®ä»¬ä¹é´æ¯ä»ä¹å ³ç³»å¢ï¼unicodeæ¯ä¸ç§ç¼ç æ¹æ¡ï¼å称ä¸å½ç ï¼å¯è§å ¶å å«ä¹å¹¿ãä½æ¯å ·ä½åå¨å°è®¡ç®æºä¸ï¼å¹¶ä¸ç¨è¿ç§ç¼ç ï¼å¯ä»¥è¯´å®èµ·çä¸ä¸ªä¸é´äººçä½ç¨ãä½ å¯ä»¥åæunicodeç¼ç (encode)为utf-8ï¼æè gbï¼ååå¨å°è®¡ç®æºä¸ãutf-8æè gbä¹å¯ä»¥è¿è¡è§£ç (decode)è¿å为unicodeãå¨pythonä¸unicodeæ¯ä¸ç±»å¯¹è±¡ï¼è¡¨ç°ä¸ºä»¥uæ头çï¼æ¯å¦u’ä¸æ’ï¼èstringåæ¯ä¸ç±»å¯¹è±¡ï¼æ¯å¨å ·ä½ç¼ç æ¹å¼ä¸çå®é åå¨è®¡ç®æºä¸çå符串ãæ¯å¦utf-8ç¼ç ä¸ç’ä¸æ’ågbkç¼ç ä¸ç’ä¸æ’ï¼å¹¶ä¸ç¸åãå¯ä»¥çå¦ä¸ä»£ç ï¼
代ç å¦ä¸:
>>> str=u’ä¸æ’>>> str1=str.encode(‘utf8’)>>> str2=str.encode(‘gbk’)>>> print repr(str)u’\u4e2d\u6587’>>> print repr(str1)’\xe4\xb8\xad\xe6\x96\x87’>>> print repr(str2)’\xd6\xd0\xce\xc4′
å¯ä»¥çå°ï¼å ¶å®åå¨å¨è®¡ç®æºä¸çåªæ¯è¿æ ·çç¼ç ï¼èä¸æ¯ä¸ä¸ªä¸ä¸ªçæ±åï¼å¨printçæ¶åè¦ç¥éå½æ¶æ¯ç¨çä»ä¹æ ·çç¼ç æ¹å¼ï¼æè½æ£ç¡®çprintåºæ¥ãæä¸ä¸ªè¯´æ³æå¾å¾å¥½ï¼pythonä¸çunicodeææ¯çæ£çå符串ï¼èstringæ¯åè串æ件ç¼ç æ¢ç¶æä¸åçç¼ç ï¼é£ä¹å¦æå¨ä»£ç æ件ä¸ç´æ¥åstringçè¯ï¼é£ä¹å®å°åºæ¯åªä¸ç§ç¼ç å¢ï¼è¿ä¸ªå°±æ¯ç±æ件çç¼ç æå³å®çãæ件æ»æ¯ä»¥ä¸å®çç¼ç æ¹å¼ä¿åçãèpythonæ件å¯ä»¥åä¸codingç声æè¯å¥ï¼ç¨æ¥è¯´æè¿ä¸ªæ件æ¯ç¨ä»ä¹ç¼ç æ¹å¼ä¿åçãå¦æ声æçç¼ç æ¹å¼åå®é ä¿åçç¼ç æ¹å¼ä¸ä¸è´å°±ä¼åºç°å¼å¸¸ãå¯ä»¥è§ä¸é¢ä¾åï¼ ä»¥utf-8ä¿åçæ件声æ为gbk
代ç å¦ä¸:
#coding:gbkstr=u’æ±’str1=str.encode(‘utf8’)str2=str.encode(‘gbk’)str3=’æ±’print repr(str)print repr(str1)print repr(str2)print repr(str3)
æ示é误 file “test.py”, line 1 syntaxerror: non-ascii character ‘\xe6’ in file test.py on line 1, but no encodi ng declared; see http://www.python.org/peps/pep-0263.html for details æ¹ä¸º
代ç å¦ä¸:
#coding:utf8str=u’æ±’str1=str.encode(‘utf8’)str2=str.encode(‘gbk’)str3=’æ±’print repr(str)print repr(str1)print repr(str2)print repr(str3)
è¾åºæ£å¸¸ç»æ u’\u6c49′ ‘\xe6\xb1\x89’ ‘\xba\xba’ ‘\xe6\xb1\x89’
åºæ¬æ¹æ³å ¶å®ç¨pythonç¬åç½é¡µå¾ç®åï¼åªæç®åçå å¥è¯
代ç å¦ä¸:
import urllib2page=urllib2.urlopen(‘url’).read()
è¿æ ·å°±å¯ä»¥è·å¾å°é¡µé¢çå 容ãæ¥ä¸æ¥åç¨æ£åå¹é å»å¹é æéè¦çå 容就è¡äºãä½æ¯ï¼çæ£è¦åèµ·æ¥ï¼å°±ä¼æåç§åæ ·çç»èé®é¢ãç»å½è¿æ¯ä¸ä¸ªéè¦ç»å½è®¤è¯çç½ç«ãä¹ä¸å¤ªé¾ï¼åªè¦å¯¼å ¥cookielibåurllibåºå°±è¡ã
代ç å¦ä¸:
import urllib,urllib2,cookielibcookiejar = cookielib.cookiejar()urlopener = urllib2.build_opener(urllib2.httpcookieprocessor(cookiejar))
è¿æ ·å°±è£ è½½è¿ä¸ä¸ªcookieï¼ç¨urlopenerå»openç»å½ä»¥åå°±å¯ä»¥è®°ä½ä¿¡æ¯ãæ线éè¿å¦æåªæ¯åå°ä¸é¢çç¨åº¦ï¼ä¸å¯¹openè¿è¡å è£ çè¯ï¼åªè¦ç½ç»ç¶åµæäºèµ·ä¼ï¼å°±ç´æ¥æåºå¼å¸¸ï¼éåºæ´ä¸ªç¨åºï¼æ¯ä¸ªå¾ä¸å¥½çç¨åºãè¿ä¸ªæ¶åï¼åªè¦å¯¹å¼å¸¸è¿è¡å¤çï¼å¤è¯å 次就è¡äºï¼
代ç å¦ä¸:
def multi_open(opener,*arg): while true: retrytimes=20 while retrytimes>0: try: return opener.open(*arg) except: print ‘.’, retrytimes-=1
æ£åå¹é å ¶å®æ£åå¹é 并ä¸ç®æ¯ä¸ä¸ªç¹å«å¥½çæ¹æ³ï¼å 为å®ç容éæ§å¾ä¸å¥½ï¼ç½é¡µè¦å®å ¨ç»ä¸ãå¦ææç¨å¾®çä¸ç»ä¸ï¼å°±ä¼å¤±è´¥ãåæ¥çå°è¯´ææ ¹æ®xpathæ¥è¿è¡éåçï¼ä¸æ¬¡å¯ä»¥å°è¯ä¸ä¸ãåæ£åå ¶å®æ¯æä¸å®æå·§çï¼é贪婪å¹é ãæ¯å¦è¿æ ·ä¸ä¸ªæ ç¾ï¼helloï¼è¦ååºaæ¥ï¼å¦æåæè¿æ ·ç表达å¼ï¼å°±ä¸è¡äºï¼helloãå 为*è¿è¡äºè´ªå©ªå¹é ãè¿æ¯è¦ç¨.?ï¼helloãè·¨è¡å¹é ãå®ç°è·¨è¡æä¸ç§æè·¯æ¯è¿ç¨dotallæ å¿ä½ï¼è¿æ ·.å°±ä¼å¹é å°æ¢è¡ãä½æ¯è¿æ ·ä¸æ¥ï¼æ´ä¸ªå¹é è¿ç¨å°±ä¼åå¾å¾æ ¢ãæ¬æ¥çå¹é æ¯ä»¥è¡ä¸ºåä½çãæ´ä¸ªè¿ç¨æå¤å°±æ¯o(nc2)ï¼næ¯è¡æ°ï¼cæ¯å¹³ååæ°ãç°å¨ææå¯è½å为o((nc)2)ãæçå®ç°æ¹æ¡æ¯è¿ç¨\næ¥å¹é æ¢è¡ï¼è¿æ ·å¯ä»¥æç¡®æåºå¹é æå¤è·¨è·å¤å°è¡ãæ¯å¦ï¼abc\s*\n\s*defï¼å°±æåºæ¥æ¾çæ¯éä¸è¡çã(.\n)?å°±å¯ä»¥æå®æ¯å¹é å°½å¯è½å°çè¡ãè¿éå ¶å®è¿è¦æ³¨æä¸ä¸ªç¹ãæçè¡æ«æ¯å¸¦æ\rçãä¹å°±æ¯è¯´ä¸è¡æ¯ä»¥\r\nç»å°¾çãå½åä¸ç¥éè¿ä¸ç¹ï¼æ£åå°±è°è¯äºå¾ä¹ ãç°å¨ç´æ¥ç¨\sï¼è¡¨ç¤ºè¡æ«ç©ºæ ¼å\rãæ æè·åç»ã为äºä¸å¯¹æè·çåç»é æå½±åï¼ä¸é¢ç(.\n)å¯ä»¥æ¹ä¸º(?:.\n)ï¼è¿æ ·æè·åç»æ¶ï¼å°±ä¼å¿½ç¥å®ãåæ¬å·è¦è¿è¡è½¬ä¹ãå 为åæ¬å·å¨æ£åéæ¯ç¨æ¥è¡¨ç¤ºåç»çï¼æ以为äºå¹é åæ¬å·å°±è¿è¡è½¬ä¹ãæ£åå符串æ好ç¨çæ¯å¸¦æråç¼çå符串ï¼å¦æä¸æ¯çè¯ï¼åè¦å¯¹\åè¿è¡è½¬ä¹ãå¿«éæ£åãåäºé£ä¹å¤æ¨¡å¼ï¼ä¹æ»ç»åºä¸è§å¾åºæ¥ãå æè¦å¹é çå符ç¸å ³ç段è½æ¿åºæ¥ãè¦å¹é çä¸è¥¿ç¨(.?)代æ¿ãææ¢è¡\næ¿æ¢ä¸ºå符串\s\n\s*ï¼åå»æè¡é¦è¡æ«çç©ºæ ¼ãæ´ä¸ªè¿ç¨å¨vimä¸å¯ä»¥å¾å¿«å°±å好ãexcelæä½è¿æ¬¡çæ°æ®æ¯æ¾è¿excelçãå°åé¢ææè¯å°å¦ææ¾è¿æ°æ®åºçè¯ï¼å¯è½å°±æ²¡æé£ä¹å¤äºäºãä½æ¯å·²ç»åå°ä¸åï¼é¾ä»¥å头äºãæç´¢excelï¼å¯ä»¥å¾åºå 个æ¹æ¡æ¥ï¼ä¸ä¸ªæ¯ç¨xlrt/xlwtåºï¼è¿ä¸ªä¸ç®¡çµèä¸æ¯å¦å®è£ äºexcelï¼é½å¯ä»¥è¿è¡ï¼ä½åªè½æ¯xlsæ ¼å¼çãè¿æä¸ä¸ªæ¯ç´æ¥å è£ äºcomï¼éè¦çµèä¸å®è£ äºè½¯ä»¶æè¡ãæéç¨çæ¯åä¸ç§ãåºæ¬ç读å没æé®é¢ãä½æ¯æ°æ®éä¸å¤§èµ·æ¥ï¼å°±æé®é¢äºãå åä¸å¤ãç¨åºä¸è·èµ·æ¥ï¼å åå ç¨å°±ä¸ç¹ä¸ç¹å¾ä¸æ¶¨ãåé¢åæ¥äºä¸ä¸ï¼ç¥éè¦ç¨flush_row_dataãä½æ¯è¿æ¯ä¼åºéãä¸çå åå ç¨ï¼æ²¡æä»ä¹é®é¢ï¼ä¸ç´å¾å¹³ç¨³ãä½æåè¿æ¯ä¼åºç°memory errorãè¿çæ¯è§é¬¼äºãåæ¯åå¤å°æ¥ï¼ åå¤å°è¿è¡ãä¸ç¹ç»æé½æ²¡æãè¦å½çæ¯bugåªå¨æ°æ®é大起æ¥æåºç°ï¼èçæ°æ®é大起æ¥å¾å¾è¦å¥½å 个å°æ¶ï¼è¿debugçææ¬å®å¨æ¯å¤ªé«äºãä¸ä¸ªå¶ç¶çæºä¼ï¼çªç¶åç°å åå ç¨ï¼è½ç¶æ»ä½å¹³ç¨³ï¼ä½æ¯ä¼è§å¾æ§çåºç°å°çé«æ¶¨ï¼èè¿è§å¾æ§ï¼ä¼ä¸ä¼åflush_row_dataï¼æå ³ãä¸ç´çæçæ¯data被flushå°äºåªéãåæ¥xlwtçä½æ³æ¯å¾èç¼çä½æ³ãææ°æ®åå¨å åéï¼æè flushå°ä¸ä¸ªtempï¼å°saveçæ¶åï¼åä¸æ¬¡æ§åå ¥ãèé®é¢æ£åºå¨è¿ä¸æ¬¡æ§åå ¥ï¼å åç涨ãé£æè¦flush_row_dataä½ç¨ï¼ä¸ºä»ä¹ä¸ä¸å¼å§å°±flushè¿è¦åå ¥çå°æ¹ãè¡æ°éå¶ãè¿ä¸ªæ¯xlsæ ¼å¼æ¬èº«å³å®çï¼æå¤è¡æ°åªè½æ¯65536ãèä¸æ°æ®ä¸å¤§ï¼æ件æå¼ä¹ä¸æ¹ä¾¿ãç»å以ä¸ä¸¤ç¹ï¼æç»éåäºè¿ä¹ä¸ä¸ªçç¥ï¼å¦æè¡æ°æ¯1000çåæ°ï¼è¿è¡ä¸æ¬¡flushï¼å¦æè¡æ°è¶ è¿65536ï¼æ°å¼ä¸ä¸ªsheetï¼å¦æè¶ è¿3个sheetï¼åæ°å»ºä¸ä¸ªæ件ã为äºæ¹ä¾¿ï¼æxlwtå è£ äºä¸ä¸
代ç å¦ä¸:
#coding:utf-8#import xlwtclass xls: ”’a class wrap the xlwt”’ max_row=65536 max_sheet_num=3 def __init__(self,name,captionlist,typelist,encoding=’utf8′,flushbound=1000): self.name=name self.captionlist=captionlist[:] self.typelist=typelist[:] self.workbookindex=1 self.encoding=encoding self.wb=xlwt.workbook(encoding=self.encoding) self.sheetindex=1 self.__addsheet() self.flushbound=flushbound def __addsheet(self): if self.sheetindex != 1: self.wb.save(self.name+str(self.workbookindex)+’.xls’) if self.sheetindex>xls.max_sheet_num: self.workbookindex+=1 self.wb=xlwt.workbook(encoding=self.encoding) self.sheetindex=1 self.sheet=self.wb.add_sheet(self.name.encode(self.encoding)+str(self.sheetindex)) for i in range(len(self.captionlist)): self.sheet.write(0,i,self.captionlist[i]) self.row=1 def write(self,data): if self.row>=xls.max_row: self.sheetindex += 1 self.__addsheet() for i in range(len(data)): if self.typelist[i]==”num”: try: self.sheet.write(self.row,i,float(data[i])) except valueerror: pass else: self.sheet.write(self.row,i,data[i]) if self.row % self.flushbound == 0: self.sheet.flush_row_data() self.row+=1 def save(self): self.wb.save(self.name+str(self.workbookindex)+’.xls’)
转æ¢ç½é¡µç¹æ®å符ç±äºç½é¡µä¹æèªå·±ç¬ç¹ç转ä¹å符ï¼å¨è¿è¡æ£åå¹é çæ¶åå°±æäºéº»ç¦ãå¨å®æ¹ææ¡£ä¸æ¥å°ä¸ä¸ªç¨åå ¸æ¿æ¢çæ¹æ¡ï¼ç§ä»¥ä¸ºä¸éï¼æ¿æ¥åäºä¸äºæ©å ãå ¶ä¸æä¸äºæ¯ä¸ºä¿ææ£åçæ£ç¡®æ§ã
代ç å¦ä¸:
html_escape_table = { “&”: “&”, ‘”‘: “””, “‘”: “'”, “>”: “>”, “