1. Overview
Pandas is a third-party Python library, together with Numpy and Matplotlib, is known as the three musketeers of data analysing.
In the past, Python was mainly used for data collection and data pre-processing, but has very limited capabilities on data analysing until Pandas emerged.
Pandas enhanced Python capabilities for data analysing, which has five steps:
- Load Data
- Organize data
- Operate data
- Build Data Model
- Analysing Data
2. Data Structure
- Series, is like an array, can hold int/float/object in it;
- DataFrame, is like a table in database, has rows and columns;
2.1 Series
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
data=np.array([10,20,30,40])
s=pd.Series(data)
'''
0 10
1 20
2 30
3 40
dtype: int64
'''
s1=pd.Series(data, index=['a', 'b', 'c','d'])
'''
a 10
b 20
c 30
d 40
dtype: int64
'''
2.2 DataFrame / Import Data
2.2.1 read_table
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
df = pd.read_table(r'./examples/score.txt',sep="\t")
df
'''
学号 班级 姓名 性别 英语 体育 军训 数分 高代 解几
0 2308024241 23080242 成龙 男 76 78 77 40 23 60
1 2308024244 23080242 周怡 女 66 91 75 47 47 44
2 2308024251 23080242 张波 男 85 81 75 45 45 60
3 2308024249 23080242 朱浩 男 65 50 80 72 62 71
4 2308024219 23080242 封印 女 73 88 92 61 47 46
5 2308024201 23080242 迟培 男 60 50 89 71 76 71
'''
2.2.2 read_csv
1
2
3
4
5
6
7
8
9
df1=pd.read_csv('./examples/score.csv')
df1
'''
Unnamed: 0 学号 班级 姓名 性别 英语 体育 军训 数分 高代 解几
0 0 2308024241 23080242 成龙 男 76 78 77 40 23 60
1 1 2308024244 23080242 周怡 女 66 91 75 47 47 44
2 2 2308024251 23080242 张波 男 85 81 75 45 45 60
3 3 2308024249 23080242 朱浩 男 65 50 80 72 62 71
'''
2.2.3 read_excel
1
2
3
4
5
6
7
8
9
10
df2=pd.read_excel('./examples/score.xlsx')
df2
'''
学号 姓名 英语 数学 语文
0 203537333 成龙 70.6 46.8 89.4
1 203537334 刘强东 59.2 49.8 76.9
2 203537335 马云 47.8 52.8 64.4
3 203537336 雷军 36.4 55.8 51.9
4 203537336 雷军 36.4 55.8 51.9
'''
2.3 DataFrame / Export Data
2.3.1 to_csv
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# default seperator is ','
df.to_csv('./output/score.csv')
'''
,学号,班级,姓名,性别,英语,体育,军训,数分,高代,解几
0,2308024241,23080242,成龙,男,76,78,77,40,23,60
1,2308024244,23080242,周怡,女,66,91,75,47,47,44
2,2308024251,23080242,张波,男,85,81,75,45,45,60
3,2308024249,23080242,朱浩,男,65,50,80,72,62,71
'''
df.to_csv('./output/score1.csv', sep=";")
'''
;学号;班级;姓名;性别;英语;体育;军训;数分;高代;解几
0;2308024241;23080242;成龙;男;76;78;77;40;23;60
1;2308024244;23080242;周怡;女;66;91;75;47;47;44
2;2308024251;23080242;张波;男;85;81;75;45;45;60
3;2308024249;23080242;朱浩;男;65;50;80;72;62;71
'''
2.3.2 to_excel
1
df.to_excel('./output/score.excel')
3. Data Operate
3.1 Data Cleaning
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
df
'''
学号 班级 姓名 性别 英语 体育 军训 数分 高代 解几
0 2308024241 23080242 成龙 男 76 78 77 40 23 60
1 2308024244 23080242 周怡 女 66 91 75 47 47 44
2 2308024251 23080242 张波 男 85 81 75 45 45 60
3 2308024249 23080242 朱浩 男 65 50 80 72 62 71
4 2308024219 23080242 封印 女 73 88 92 61 47 46
5 2308024201 23080242 迟培 男 60 50 89 71 76 71
...
15 2308024421 23080244 林建祥 男 72 72 81 63 90 75
16 2308024433 23080244 李大强 男 79 76 77 78 70 70
17 2308024428 23080244 李侧通 男 64 96 91 69 60 77
18 2308024402 23080244 王慧 女 73 74 93 70 71 75
19 2308024422 23080244 李晓亮 男 85 60 85 72 72 83
20 2308024201 23080242 迟培 男 60 50 89 71 76 71
'''
df.duplicated()
'''
0 False
1 False
2 False
...
16 False
17 False
18 False
19 False
20 True
dtype: bool
we can see the last line was duplicated.
'''
df.drop_duplicates()
'''
学号 班级 姓名 性别 英语 体育 军训 数分 高代 解几
0 2308024241 23080242 成龙 男 76 78 77 40 23 60
1 2308024244 23080242 周怡 女 66 91 75 47 47 44
2 2308024251 23080242 张波 男 85 81 75 45 45 60
...
15 2308024421 23080244 林建祥 男 72 72 81 63 90 75
16 2308024433 23080244 李大强 男 79 76 77 78 70 70
17 2308024428 23080244 李侧通 男 64 96 91 69 60 77
18 2308024402 23080244 王慧 女 73 74 93 70 71 75
19 2308024422 23080244 李晓亮 男 85 60 85 72 72 83
the duplicated row was removed from the result.
'''
3.2 Data Extract
extract a given position data from a certain field to create a new field slice(start, stop)
3.3 Alter Record
1
2
# score_2 is the new column which will be generated by other operation
df['daily_score'] = score_2