使用python网页文档处理脚本实例代码

　　嵌入式web服务器不同于传统服务器，web需要转换成数组格式保存在flash中，才方便lwip网络接口的调用，最近因为业务需求，需要频繁修改网页，每次的压缩和转换就是个很繁琐的过程，因此我就有了利用所掌握的知识，利用python编写个能够批量处理网页文件，压缩并转换成数组的脚本。

　　脚本运行背景(后续版本兼容)：

python 3.5.1(下载、安装、配置请参考网上教程)

node.js v4.4.7，安装uglifyjs管理包，支持js文件非文本压缩

uglifyjs 用来压缩js文件的引擎

具体实现代码如下:

#/usr/bin/python
import os
import binascii
import shutil
from functools import partial
import re
import gzip
#创建一个新文件夹
def mkdir(path):
path=path.strip()
isexists=os.path.exists(path)
#判断文件夹是否存在，不存在则创建
if not isexists:
os.makedirs(path)
print(path+’ 创建成功’)
else:
pass
return path
#删除一个文件夹(包含内部所有文件)
def deldir(path):
path = path.strip()
isexists=os.path.exists(path)
#判断文件夹是否存在，存在则删除
if isexists:
shutil.rmtree(path)
print(path + “删除成功”)
else:
pass
#网页一次压缩文件
def filereduce(inpath, outpath):
infp = open(inpath, “r”, encoding=”utf-8″)
outfp = open(outpath, “w”, encoding=”utf-8″)
for li in infp.readlines():
if li.split():
#去除多余的\r \n
li = li.replace(‘\n’, ”).replace(‘\t’, ”);
#空格只保留一个
li = ‘ ‘.join(li.split())
outfp.writelines(li)
infp.close()
outfp.close()
print(outpath+” 压缩成功”)
#shell命令行调用(用ugllifyjs来压缩js文件)
def shellreduce(inpath, outpath):
command = “uglifyjs “+inpath+” -m -o “+outpath
print(command)
os.system(command)
#gzip压缩模块
def filegzip(inpath, outpath):
with open(inpath, ‘rb’) as plain_file:
with gzip.open(outpath, ‘wb’) as zip_file:
zip_file.writelines(plain_file)
print(outpath+” gzip-压缩成功”)
#将文件以二进制读取, 并转化成数组保存
def filehex(inpath, outpath):
i = 0
count = 0
a = ”
inf = open(inpath, ‘rb’);
outf = open(outpath, ‘w’)
records = iter(partial(inf.read, 1), b”)
for r in records:
r_int = int.from_bytes(r, byteorder=’big’)
a += strzfill(hex(r_int), 2, 2) + ‘, ‘
i += 1
count += 1
if i == 16:
a += ‘\n’
i = 0
a = “const static char ” + outpath.split(‘.’)[-2].split(‘/’)[-1] + “[“+ str(count) +”]={\n” + a + “\n}\n\n”
outf.write(a)
inf.close()
outf.close()
print(outpath + ” 转换成数组成功”)
#在指定位置填充0
def strzfill(istr, index, n):
return istr[:index] + istr[index:].zfill(n)
#去css注释 /*…..*/
def uncommentreduce(inpath, outpath):
infp = open(inpath, “r”, encoding=”utf-8″)
outfp = open(outpath, “w”, encoding=”utf-8″)
filebyte = infp.read();
replace_reg = re.compile(‘/\*[\s\s]*?\*/’)
filebyte = replace_reg.sub(”, filebyte)
filebyte = filebyte.replace(‘\n’, ”).replace(‘\t’, ”);
filebyte = ‘ ‘.join(filebyte.split())
outfp.write(filebyte)
infp.close()
outfp.close()
print(outpath+”去注释压缩成功!”)
#程序处理主函数
def webprocess(path):
#原网页 ..\basic\
#压缩网页 ..\reduce\
#gzip二次压缩 ..\gzip
#编译完成.c网页 ..\programe
basicpath = path + “\\basic”
reducepath = path + “\\reduce”
gzippath = path + “\\gzip”
programpath = path + “\\program”
#删除原文件夹，再创建新文件夹
deldir(programpath)
deldir(reducepath)
deldir(gzippath)
mkdir(programpath)
for root, dirs, files in os.walk(basicpath):
for item in files:
ext = item.split(‘.’)
infilepath = root + “/” + item
outreducepath = mkdir(root.replace(“basic”, “reduce”)) + “/” + item
outgzippath = mkdir(root.replace(“basic”, “gzip”)) + “/” + item + ‘.gz’
outprogrampath = programpath + “/” + item.replace(‘.’, ‘_’) + ‘.c’
#根据后缀不同进行相应处理
#html 去除’\n’,’\t’, 空格字符保留1个
#css 去除\*……*\注释数据、’\n’和’\t’, 同时空格字符保留1个
#js 调用uglifyjs2进行压缩
#gif jpg ico 直接拷贝
#其它直接拷贝
#上述执行完毕后压缩成.gz文件
#除其它外，剩余文件同时转化成16进制数组, 保存为.c文件
if ext[-1] == ‘html’:
filereduce(infilepath, outreducepath)
filegzip(outreducepath, outgzippath)
filehex(outgzippath, outprogrampath)
elif ext[-1] == ‘css’:
uncommentreduce(infilepath, outreducepath)
filegzip(outreducepath, outgzippath)
filehex(outgzippath, outprogrampath)
elif ext[-1] == ‘js’:
shellreduce(infilepath, outreducepath)
filegzip(outreducepath, outgzippath)
filehex(outgzippath, outprogrampath)
elif ext[-1] in [“gif”, “jpg”, “ico”]:
shutil.copy(infilepath, outreducepath)
filegzip(outreducepath, outgzippath)
filehex(outgzippath, outprogrampath)
else:
shutil.copy(infilepath, outreducepath)
#获得当前路径
path = os.path.split(os.path.realpath(__file__))[0];
webprocess(path)

上述实现的原理主要包含：

1.遍历待处理文件夹(路径为..\basic，需要用户创建，并将处理文件复制到其中，并将脚本放置到该文件夹上一层)–webprocess

2.创建压缩页面文件夹(..\reduce, 用于存储压缩后文件), 由脚本完成，处理动作：

　htm: 删除文本中的多余空格，换行符

　css: 删除文本中的多余空格，换行符及注释文件/*……*/

js：调用uglifyjs进行压缩处理

gif, jpg, ico和其它: 直接进行复制处理

3.创建gzip文件处理文件夹(..\gzip, 用于保存二次压缩后文件), 由脚本调用gzip模块完成。

4.创建处理页面文件夹(..\program, 用于存储压缩后文件), 由脚本完成，处理动作：

　以二进制模式读取文件，并转换成16进制字符串写入到文件中。

在文件夹下(shift+鼠标右键)启用windows命令行，并输入python web.py, 就可以通过循环重复这三个过程就可以完成所有文件的处理。

特别注意：所有处理的文件需要以utf-8格式存储，否则读取时会报”gbk”读取错误。

实现效果如下图

html文件：

另外附送一个小的脚本，查询当前目录及子文件夹下选定代码行数和空行数(算是写这个脚本测试时衍生出来的):

#/usr/bin/python
import os
total_count = 0;
empty_count = 0;
def countline(path):
global total_count
global empty_count
tempfile = open(path)
for lines in tempfile:
total_count += 1
if len(lines.strip()) == 0:
empty_count += 1
def totalline(path):
for root, dirs, files in os.walk(path):
for item in files:
ext = item.split(‘.’)
ext = ext[-1]
if(ext in [“cpp”, “c”, “h”, “java”, “php”]):
subpath = root + “/” + item
countline(subpath)
path = os.path.split(os.path.realpath(__file__))[0];
totalline(path)
print(“input path:”, path)
print(“total lines: “,total_count)
print(“empty lines: “,empty_count)
print(“code lines: “, (total_count-empty_count))

以上就是使用python网页文档处理脚本实例代码的详细内容，更多请关注第一php社区其它相关文章！

发表评论 取消回复

发表评论取消回复