通过python中的pandas库对cdn日志进行分析详解

前言

最近工作工作中遇到一个需求，是要根据cdn日志过滤一些数据，例如流量、状态码统计，top ip、url、ua、referer等。以前都是用 bash shell 实现的，但是当日志量较大，日志文件数g、行数达数千万亿级时，通过 shell 处理有些力不从心，处理时间过长。于是研究了下python pandas这个数据处理库的使用。一千万行日志，处理完成在40s左右。

代码

#!/usr/bin/python
# -*- coding: utf-8 -*-
# sudo pip install pandas
__author__ = ‘loya chen’
import sys
import pandas as pd
from collections import ordereddict
“””
description: this script is used to analyse qiniu cdn log.
================================================================================
日志格式
ip – responsetime [time +0800] “method url http/1.1” code size “referer” “ua”
================================================================================
日志示例
[0] [1][2] [3] [4] [5]
101.226.66.179 – 68 [16/nov/2016:04:36:40 +0800] “get # -”
[6] [7] [8] [9]
200 502 “-” “mozilla/5.0 (compatible; msie 9.0; windows nt 6.1; trident/5.0)”
================================================================================
“””
if len(sys.argv) != 2:
print(‘usage:’, sys.argv[0], ‘file_of_log’)
exit()
else:
log_file = sys.argv[1]
# 需统计字段对应的日志位置
ip = 0
url = 5
status_code = 6
size = 7
referer = 8
ua = 9
# 将日志读入dataframe
reader = pd.read_table(log_file, sep=’ ‘, names=[i for i in range(10)], iterator=true)
loop = true
chunksize = 10000000
chunks = []
while loop:
try:
chunk = reader.get_chunk(chunksize)
chunks.append(chunk)
except stopiteration:
#iteration is stopped.
loop = false
df = pd.concat(chunks, ignore_index=true)
byte_sum = df[size].sum() #流量统计
top_status_code = pd.dataframe(df[6].value_counts()) #状态码统计
top_ip = df[ip].value_counts().head(10) #top ip
top_referer = df[referer].value_counts().head(10) #top referer
top_ua = df[ua].value_counts().head(10) #top user-agent
top_status_code[‘persent’] = pd.dataframe(top_status_code/top_status_code.sum()*100)
top_url = df[url].value_counts().head(10) #top url
top_url_byte = df[[url,size]].groupby(url).sum().apply(lambda x:x.astype(float)/1024/1024) \
.round(decimals = 3).sort_values(by=[size], ascending=false)[size].head(10) #请求流量最大的url
top_ip_byte = df[[ip,size]].groupby(ip).sum().apply(lambda x:x.astype(float)/1024/1024) \
.round(decimals = 3).sort_values(by=[size], ascending=false)[size].head(10) #请求流量最多的ip
# 将结果有序存入字典
result = ordereddict([(“流量总计[单位:gb]:” , byte_sum/1024/1024/1024),
(“状态码统计[次数|百分比]:” , top_status_code),
(“ip top 10:” , top_ip),
(“referer top 10:” , top_referer),
(“ua top 10:” , top_ua),
(“url top 10:” , top_url),
(“请求流量最大的url top 10[单位:mb]:” , top_url_byte),
(“请求流量最大的ip top 10[单位:mb]:” , top_ip_byte)
])
# 输出结果
for k,v in result.items():
print(k)
print(v)
print(‘=’*80)

pandas 学习笔记

pandas 中有两种基本的数据结构，series 和 dataframe。 series 是一种类似于一维数组的对象，由一组数据和索引组成。 dataframe 是一个表格型的数据结构，既有行索引也有列索引。

from pandas import series, dataframe
import pandas as pd

series

in [1]: obj = series([4, 7, -5, 3])
in [2]: obj
out[2]:
0 4
1 7
2 -5
3 3

series的字符串表现形式为：索引在左边，值在右边。没有指定索引时，会自动创建一个0到n-1（n为数据的长度）的整数型索引。可以通过series的values和index属性获取其数组表示形式和索引对象:

in [3]: obj.values
out[3]: array([ 4, 7, -5, 3])
in [4]: obj.index
out[4]: rangeindex(start=0, stop=4, step=1)

通常创建series时会指定索引:

in [5]: obj2 = series([4, 7, -5, 3], index=[‘d’, ‘b’, ‘a’, ‘c’])
in [6]: obj2
out[6]:
d 4
b 7
a -5
c 3

通过索引获取series中的单个或一组值：

in [7]: obj2[‘a’]
out[7]: -5
in [8]: obj2[[‘c’,’d’]]
out[8]:
c 3
d 4

排序

in [9]: obj2.sort_index()
out[9]:
a -5
b 7
c 3
d 4
in [10]: obj2.sort_values()
out[10]:
a -5
c 3
d 4
b 7

筛选运算

in [11]: obj2[obj2 > 0]
out[11]:
d 4
b 7
c 3
in [12]: obj2 * 2
out[12]:
d 8
b 14
a -10
c 6

成员

in [13]: ‘b’ in obj2
out[13]: true
in [14]: ‘e’ in obj2
out[14]: false

通过字典创建series

in [15]: sdata = {‘shanghai’:35000, ‘beijing’:40000, ‘nanjing’:26000, ‘hangzhou’:30000}
in [16]: obj3 = series(sdata)
in [17]: obj3
out[17]:
beijing 40000
hangzhou 30000
nanjing 26000
shanghai 35000

如果只传入一个字典，则结果series中的索引就是原字典的键（有序排列）

in [18]: states = [‘beijing’, ‘hangzhou’, ‘shanghai’, ‘suzhou’]
in [19]: obj4 = series(sdata, index=states)
in [20]: obj4
out[20]:
beijing 40000.0
hangzhou 30000.0
shanghai 35000.0
suzhou nan

当指定index时，sdata中跟states索引相匹配的3个值会被找出并放到响应的位置上，但由于‘suzhou’所对应的sdata值找不到，所以其结果为nan(not a number),pandas中用于表示缺失或na值

pandas的isnull和notnull函数可以用于检测缺失数据:

in [21]: pd.isnull(obj4)
out[21]:
beijing false
hangzhou false
shanghai false
suzhou true
in [22]: pd.notnull(obj4)
out[22]:
beijing true
hangzhou true
shanghai true
suzhou false

series也有类似的实例方法

in [23]: obj4.isnull()
out[23]:
beijing false
hangzhou false
shanghai false
suzhou true

series的一个重要功能是，在数据运算中，自动对齐不同索引的数据

in [24]: obj3
out[24]:
beijing 40000
hangzhou 30000
nanjing 26000
shanghai 35000
in [25]: obj4
out[25]:
beijing 40000.0
hangzhou 30000.0
shanghai 35000.0
suzhou nan
in [26]: obj3 + obj4
out[26]:
beijing 80000.0
hangzhou 60000.0
nanjing nan
shanghai 70000.0
suzhou nan

series的索引可以通过复制的方式就地修改

in [27]: obj.index = [‘bob’, ‘steve’, ‘jeff’, ‘ryan’]
in [28]: obj
out[28]:
bob 4
steve 7
jeff -5
ryan 3

dataframe

pandas读取文件

in [29]: df = pd.read_table(‘pandas_test.txt’,sep=’ ‘, names=[‘name’, ‘age’])
in [30]: df
out[30]:
name age
0 bob 26
1 loya 22
2 denny 20
3 mars 25

dataframe列选取

df[name]in [31]: df[‘name’]
out[31]:
0 bob
1 loya
2 denny
3 mars
name: name, dtype: object

dataframe行选取

df.iloc[0,:] #第一个参数是第几行，第二个参数是列。这里指第0行全部列
df.iloc[:,0] #全部行，第0列in [32]: df.iloc[0,:]
out[32]:
name bob
age 26
name: 0, dtype: object
in [33]: df.iloc[:,0]
out[33]:
0 bob
1 loya
2 denny
3 mars
name: name, dtype: object

获取一个元素，可以通过iloc，更快的方式是iat

in [34]: df.iloc[1,1]
out[34]: 22
in [35]: df.iat[1,1]
out[35]: 22

dataframe块选取

in [36]: df.loc[1:2,[‘name’,’age’]]
out[36]:
name age
1 loya 22
2 denny 20

根据条件过滤行

在方括号中加入判断条件来过滤行，条件必需返回 true 或者 false

in [37]: df[(df.index >= 1) & (df.index 22]
out[38]:
name age city
0 bob 26 beijing
3 mars 25 nanjing

增加列

in [39]: df[‘city’] = [‘beijing’, ‘shanghai’, ‘hangzhou’, ‘nanjing’]
in [40]: df
out[40]:
name age city
0 bob 26 beijing
1 loya 22 shanghai
2 denny 20 hangzhou
3 mars 25 nanjing

排序

按指定列排序

in [41]: df.sort_values(by=’age’)
out[41]:
name age city
2 denny 20 hangzhou
1 loya 22 shanghai
3 mars 25 nanjing
0 bob 26 beijing# 引入numpy 构建 dataframe
import numpy as npin [42]: df = pd.dataframe(np.arange(8).reshape((2, 4)), index=[‘three’, ‘one’], columns=[‘d’, ‘a’, ‘b’, ‘c’])
in [43]: df
out[43]:
d a b c
three 0 1 2 3
one 4 5 6 7# 以索引排序
in [44]: df.sort_index()
out[44]:
d a b c
one 4 5 6 7
three 0 1 2 3
in [45]: df.sort_index(axis=1)
out[45]:
a b c d
three 1 2 3 0
one 5 6 7 4
# 降序
in [46]: df.sort_index(axis=1, ascending=false)
out[46]:
d c b a
three 0 3 2 1
one 4 7 6 5

查看

# 查看表头5行
df.head(5)
# 查看表末5行
df.tail(5)
# 查看列的名字
in [47]: df.columns
out[47]: index([‘name’, ‘age’, ‘city’], dtype=’object’)
# 查看表格当前的值
in [48]: df.values
out[48]:
array([[‘bob’, 26, ‘beijing’],
[‘loya’, 22, ‘shanghai’],
[‘denny’, 20, ‘hangzhou’],
[‘mars’, 25, ‘nanjing’]], dtype=object)

转置

df.t
out[49]:
0 1 2 3
name bob loya denny mars
age 26 22 20 25
city beijing shanghai hangzhou nanjing

使用isin

in [50]: df2 = df.copy()
in [51]: df2[df2[‘city’].isin([‘shanghai’,’nanjing’])]
out[52]:
name age city
1 loya 22 shanghai
3 mars 25 nanjing

运算操作：

in [53]: df = pd.dataframe([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
…: index=[‘a’, ‘b’, ‘c’, ‘d’], columns=[‘one’, ‘two’])
in [54]: df
out[54]:
one two
a 1.40 nan
b 7.10 -4.5
c nan nan
d 0.75 -1.3#按列求和
in [55]: df.sum()
out[55]:
one 9.25
two -5.80
# 按行求和
in [56]: df.sum(axis=1)
out[56]:
a 1.40
b 2.60
c nan
d -0.55

group

group 指的如下几步：

splitting the data into groups based on some criteria

applying a function to each group independently

combining the results into a data structure

see the grouping section

in [57]: df = pd.dataframe({‘a’ : [‘foo’, ‘bar’, ‘foo’, ‘bar’,
….: ‘foo’, ‘bar’, ‘foo’, ‘foo’],
….: ‘b’ : [‘one’, ‘one’, ‘two’, ‘three’,
….: ‘two’, ‘two’, ‘one’, ‘three’],
….: ‘c’ : np.random.randn(8),
….: ‘d’ : np.random.randn(8)})
….:
in [58]: df
out[58]:
a b c d
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033

group一下，然后应用sum函数

in [59]: df.groupby(‘a’).sum()
out[59]:
c d
a
bar -2.802588 2.42611
foo 3.146492 -0.63958
in [60]: df.groupby([‘a’,’b’]).sum()
out[60]:
c d
a b
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434

更多通过python中的pandas库对cdn日志进行分析详解相关文章请关注php中文网！

python用pandas读csv文件写到mysql的方法

python数据分析之真实ip请求pandas详解

用python的pandas框架操作excel文件中的数据教程

发表评论 取消回复

发表评论取消回复