Machine Learning - Pandas

Posted by Bourne's Blog - A Full-stack & Web3 Developer on April 30, 2022

1. Overview

Pandas is a third-party Python library, together with Numpy and Matplotlib, is known as the three musketeers of data analysing.

In the past, Python was mainly used for data collection and data pre-processing, but has very limited capabilities on data analysing until Pandas emerged.

Pandas enhanced Python capabilities for data analysing, which has five steps:

  • Load Data
  • Organize data
  • Operate data
  • Build Data Model
  • Analysing Data

2. Data Structure

  • Series, is like an array, can hold int/float/object in it;
  • DataFrame, is like a table in database, has rows and columns;

2.1 Series

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
import numpy as np
data=np.array([10,20,30,40])
s=pd.Series(data)
'''
0    10
1    20
2    30
3    40
dtype: int64
'''
s1=pd.Series(data, index=['a', 'b', 'c','d'])
'''
a    10
b    20
c    30
d    40
dtype: int64
'''

2.2 DataFrame / Import Data

2.2.1 read_table

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
df = pd.read_table(r'./examples/score.txt',sep="\t")
df
'''
    学号	班级	姓名	性别	英语	体育	军训	数分	高代	解几
0	2308024241	23080242	成龙	男	76	78	77	40	23	60
1	2308024244	23080242	周怡	女	66	91	75	47	47	44
2	2308024251	23080242	张波	男	85	81	75	45	45	60
3	2308024249	23080242	朱浩	男	65	50	80	72	62	71
4	2308024219	23080242	封印	女	73	88	92	61	47	46
5	2308024201	23080242	迟培	男	60	50	89	71	76	71
'''

2.2.2 read_csv

1
2
3
4
5
6
7
8
9
df1=pd.read_csv('./examples/score.csv')
df1
'''
Unnamed: 0	学号	班级	姓名	性别	英语	体育	军训	数分	高代	解几
0	0	2308024241	23080242	成龙	男	76	78	77	40	23	60
1	1	2308024244	23080242	周怡	女	66	91	75	47	47	44
2	2	2308024251	23080242	张波	男	85	81	75	45	45	60
3	3	2308024249	23080242	朱浩	男	65	50	80	72	62	71
'''

2.2.3 read_excel

1
2
3
4
5
6
7
8
9
10
df2=pd.read_excel('./examples/score.xlsx')
df2
'''
	学号	姓名	英语	数学	语文
0	203537333	成龙	70.6	46.8	89.4
1	203537334	刘强东	59.2	49.8	76.9
2	203537335	马云	47.8	52.8	64.4
3	203537336	雷军	36.4	55.8	51.9
4	203537336	雷军	36.4	55.8	51.9
'''

2.3 DataFrame / Export Data

2.3.1 to_csv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# default seperator is ','
df.to_csv('./output/score.csv')
'''
,学号,班级,姓名,性别,英语,体育,军训,数分,高代,解几
0,2308024241,23080242,成龙,男,76,78,77,40,23,60
1,2308024244,23080242,周怡,女,66,91,75,47,47,44
2,2308024251,23080242,张波,男,85,81,75,45,45,60
3,2308024249,23080242,朱浩,男,65,50,80,72,62,71
''' 
df.to_csv('./output/score1.csv', sep=";")
'''
;学号;班级;姓名;性别;英语;体育;军训;数分;高代;解几
0;2308024241;23080242;成龙;男;76;78;77;40;23;60
1;2308024244;23080242;周怡;女;66;91;75;47;47;44
2;2308024251;23080242;张波;男;85;81;75;45;45;60
3;2308024249;23080242;朱浩;男;65;50;80;72;62;71
'''

2.3.2 to_excel

1
df.to_excel('./output/score.excel')

3. Data Operate

3.1 Data Cleaning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
df
'''
学号	班级	姓名	性别	英语	体育	军训	数分	高代	解几
0	2308024241	23080242	成龙	男	76	78	77	40	23	60
1	2308024244	23080242	周怡	女	66	91	75	47	47	44
2	2308024251	23080242	张波	男	85	81	75	45	45	60
3	2308024249	23080242	朱浩	男	65	50	80	72	62	71
4	2308024219	23080242	封印	女	73	88	92	61	47	46
5	2308024201	23080242	迟培	男	60	50	89	71	76	71
...
15	2308024421	23080244	林建祥	男	72	72	81	63	90	75
16	2308024433	23080244	李大强	男	79	76	77	78	70	70
17	2308024428	23080244	李侧通	男	64	96	91	69	60	77
18	2308024402	23080244	王慧	女	73	74	93	70	71	75
19	2308024422	23080244	李晓亮	男	85	60	85	72	72	83
20	2308024201	23080242	迟培	男	60	50	89	71	76	71
'''
df.duplicated()
'''
0     False
1     False
2     False
...
16    False
17    False
18    False
19    False
20     True 
dtype: bool

we can see the last line was duplicated.
'''

df.drop_duplicates()
''' 
学号	班级	姓名	性别	英语	体育	军训	数分	高代	解几
0	2308024241	23080242	成龙	男	76	78	77	40	23	60
1	2308024244	23080242	周怡	女	66	91	75	47	47	44
2	2308024251	23080242	张波	男	85	81	75	45	45	60
...
15	2308024421	23080244	林建祥	男	72	72	81	63	90	75
16	2308024433	23080244	李大强	男	79	76	77	78	70	70
17	2308024428	23080244	李侧通	男	64	96	91	69	60	77
18	2308024402	23080244	王慧	女	73	74	93	70	71	75
19	2308024422	23080244	李晓亮	男	85	60	85	72	72	83

the duplicated row was removed from the result.
'''

3.2 Data Extract

extract a given position data from a certain field to create a new field slice(start, stop)

3.3 Alter Record

1
2
# score_2 is the new column which will be generated by other operation 
df['daily_score'] = score_2

4. Data Analysing