3. Data Structure

Yi-Ju Tseng

Common Data Structure 資料結構

A particular way of organizing data in a computer

序列 (sequence)
- 表 (list)
- 定值表 (tuple)
- 範圍 (range)
映射 (mapping)
- 字典 (dist)

矩陣 (matrix, array)
- 一般矩陣
- numpy array
資料框 (data frame)
- pandas 資料框

Common Data Structure 資料結構

序列 (sequence)
- 表 (list)
- 定值表 (tuple)
- 範圍 (range)
映射 (mapping)
- 字典 (dist)

矩陣 (matrix, array)
- 一般矩陣
- numpy array
資料框 (data frame)
- pandas 資料框

序列 (sequence)

表 (list)

A collection allows us to put many values in a single “variable”

surrounded by square brackets[ ]
separated by commas,
element can be any Python object
can be empty
可作運算

list1 = [3.5, 4.6, 5.7]
print(list1)

[3.5, 4.6, 5.7]

list and for loop

依序將list結構的內容取出，設定為x，並印出來

friends = ['Joseph', 'Glenn', 'Sally']
for x in friends:
    print('Happy New Year:',  x)
print('Done!')

Happy New Year: Joseph
Happy New Year: Glenn
Happy New Year: Sally
Done!

list 取值

sequence包含：list, tuple, range
使用中括號加上index取值變數名稱[index]
index從0開始

位置	1	2	3	4	…
index	0	1	2	3	…

print(friends)

['Joseph', 'Glenn', 'Sally']

print(friends[0])

Joseph

print(friends[1])

Glenn

print(friends[2])

Sally

list 取值 - slice

start:up to but not including

print(friends)

['Joseph', 'Glenn', 'Sally']

print(friends[1:2])

['Glenn']

print(friends[:2])

['Joseph', 'Glenn']

print(friends[:1])

['Joseph']

範圍 (range)

使用range(數列起點,數列終點[不包含],間隔)宣告
若只有輸入一個參數，視為數列終點，起點設為預設值0
若只有輸入兩個參數，則間隔設為預設值1

r1 = range(10)
print(r1)

range(0, 10)

r2 = range(10,20)
print(r2)

range(10, 20)

r3 = range(10,20,2)
print(r3)

range(10, 20, 2)

範圍 (range)

range(0, 10)

for i in r1:
  print(i)

range(10, 20, 2)

for i in r3:
  print(i)

list and for loop - range

friends = ['Joseph', 'Glenn', 'Sally']
print(len(friends))

print(range(len(friends)))

range(0, 3)

for x in friends:
    print('Happy New Year:',  x)
print('Done!')

Happy New Year: Joseph
Happy New Year: Glenn
Happy New Year: Sally
Done!

for i in range(len(friends)) :
    friend = friends[i]
    print(friend)
print('Done!')

Joseph
Glenn
Sally
Done!

list 新增修改

Lists are “mutable”
- we can change an element of a list using the index operator

print(friends)

['Joseph', 'Glenn', 'Sally']

friends[0]='new friend'
print(friends)

['new friend', 'Glenn', 'Sally']

list 新增修改 +

We can create a new list by adding + two existing lists together

friend_cs = ['Joseph', 'Glenn', 'Sally']
friend_md = ['Alex', 'John', 'Ray']
print(friend_cs+friend_md)

['Joseph', 'Glenn', 'Sally', 'Alex', 'John', 'Ray']

list 新增修改 append

We can create an empty list and then add elements using the append method

print(friend_cs)

['Joseph', 'Glenn', 'Sally']

friend_cs.append('Megan')
print(friend_cs)

['Joseph', 'Glenn', 'Sally', 'Megan']

list的運算功能

list1 = [3.5, 4.6, 5.7]
print(list1)

[3.5, 4.6, 5.7]

sum(list物件名稱) 物件加總

sum(list1)

13.8

max(list物件名稱) 物件最大值

list1 = [3.5, 4.6, 5.7]
max(list1)

5.7

可參考文件

定值表 (tuple)

Like list…
surrounded by square brackets( )
separated by commas,
Tuples are “immutable” –> More Efficient
- No .append(), .sort(), etc

tuple1 = (3.5, 4.6, 5.7, "111", "222")
print(tuple1)

(3.5, 4.6, 5.7, '111', '222')

定值表 (tuple) assignment

We can also put a tuple on the left-hand side of an assignment statement

(x, y) = (4, 'fred')
print(y)

fred

Hands-on

新增一sequence a，包含數字1到10
新增一sequence b，包含數字1到20中的所有偶數
取出a sequence 的第4個值

Common Data Structure 資料結構

序列 (sequence)
- 表 (list)
- 定值表 (tuple)
- 範圍 (range)
映射 (mapping)
- 字典 (dist)

矩陣 (matrix, array)
- 一般矩陣
- numpy array
資料框 (data frame)
- pandas 資料框

映射 (mapping)

字典 Dictionaries (dist)

Dictionaries are Python’s most powerful data collection
使用大括號 { }宣告
內容為key : value的組合，並以,分隔
key不能重複，通常為字串

dist1 = {"id":1, "name":"Ryan","age":20, "School":"NYCU"}
print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}

mapping 取值 [ ]

mapping包含：dist
使用中括號加上key取值變數名稱[key]

print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}

print(dist1['age'])

mapping 取值 .keys()

[key1, key2, …]

print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}

print(dist1.keys())

dict_keys(['id', 'name', 'age', 'School'])

mapping 取值 .values()

[value1, value2, …]

print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}

print(dist1.values())

dict_values([1, 'Ryan', 20, 'NYCU'])

mapping 取值 .items()

[(key1, value1), (key2, value2),…]

print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}

print(dist1.items())

dict_items([('id', 1), ('name', 'Ryan'), ('age', 20), ('School', 'NYCU')])

mapping 新增修改 .update()

mapping包含：dist
mapping物件.update()可新增或修改內容(key:value pairs)至mapping中

print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}

dist1.update({"age":25, "dept":"CS"})
print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 25, 'School': 'NYCU', 'dept': 'CS'}

mapping 新增修改 [ ]

[ ]取值後寫回

print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 25, 'School': 'NYCU', 'dept': 'CS'}

dist1['age']=dist1['age']+10
print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 35, 'School': 'NYCU', 'dept': 'CS'}

mapping 新增修改 .get()

.get(key, 0) 取值
- key: key
- 0: any default value

print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 35, 'School': 'NYCU', 'dept': 'CS'}

dist1['age']=dist1.get('age', 0)+10
print(dist1)

{'id': 1, 'name': 'Ryan', 'age': 45, 'School': 'NYCU', 'dept': 'CS'}

Hands-on

新增一mapping dist1，內容為

dist1 = {"id":[1,2,3,4], 
        "name":["Ryan","Tom","Emma","Amy"], 
        "School":"NYCU"}

取出id中的第3個值
取出name中的第2個值的第1個字元（字母）

mapping + for

mapping結構（dist）預設取得部份為key
使用.values()即可取value (值)

print(dist1)

{'id': [1, 2, 3, 4], 'name': ['Ryan', 'Tom', 'Emma', 'Amy'], 'School': 'NYCU'}

Key

for i in dist1:
  print(i)

id
name
School

Value

for i in dist1.values():
  print(i)

[1, 2, 3, 4]
['Ryan', 'Tom', 'Emma', 'Amy']
NYCU

mapping + for - key and value

print(dist1)

{'id': [1, 2, 3, 4], 'name': ['Ryan', 'Tom', 'Emma', 'Amy'], 'School': 'NYCU'}

先得到key，再取得value

for key in dist1:
  print(key, dist1[key])

id [1, 2, 3, 4]
name ['Ryan', 'Tom', 'Emma', 'Amy']
School NYCU

或是直接使用.items()

for key,value in dist1.items():
  print(key, value)

id [1, 2, 3, 4]
name ['Ryan', 'Tom', 'Emma', 'Amy']
School NYCU

Hands-on

Most Common Name? Edit the ???? parts

names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']

Use dist and for to get the most common name and its count

names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']
counts = dict()
for name in ??? : # edit here
    if name not in counts: 
      counts[name] = ??? # edit here
    else :
      ??? # edit here
print(counts)

names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']
counts = dict()
for name in ??? : # edit here
    counts[name] = ??? + 1 # edit here
print(counts)

Common Data Structure 資料結構

序列 (sequence)
- 表 (list)
- 定值表 (tuple)
- 範圍 (range)
映射 (mapping)
- 字典 (dist)

矩陣 (matrix, array)
- 一般矩陣
- numpy array
資料框 (data frame)
- pandas 資料框

矩陣 (matrix, array)

一般矩陣 (matrix, array)

可想成sequence (1維矩陣)的堆疊
sequence為row的組成
直接以中括號分隔matrix的row和column

Col1	Col2	Col3
1	2	3
4	5	6

matrix1 = [[1, 2, 3], 
          [4, 5, 6]]
print(matrix1)

[[1, 2, 3], [4, 5, 6]]

row1 = [1, 2, 3]
row2 = [4, 5, 6]
matrix2 = [row1, row2]
print(matrix2)

[[1, 2, 3], [4, 5, 6]]

numpy array - 1維

from numpy library (Numerical Python)
like list type, but arrays provide much more efficient storage and data operations (fixed-type)
Use .array(list) to create numpy array from list

import numpy as np
a = np.array([0, 0.5, 1.0, 1.5, 2.0]) 
list_string = ['a', 'b', 'c']
b = np.array(list_string) 
print([a,b])

[array([0. , 0.5, 1. , 1.5, 2. ]), array(['a', 'b', 'c'], dtype='<U1')]

numpy array - 1維

或是用.arrange(起點,終點[不包含],間隔)來生成numpy array，如同range()的用法

c = np.arange(0, 10, 2) 
print(c)

[0 2 4 6 8]

numpy array - 2維

可用兩個sequence組合(一般2維矩陣)新增numpy array
預設以row為維度堆疊

two_d = np.array([row1,row2]) 
print(two_d)

[[1 2 3]
 [4 5 6]]

numpy array - Creating Arrays from Scratch

np.full(dim, value): repeated values
2d dimension: (row, column)

np.full((2, 4), 3.14)

array([[3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14]])

np.random.random(dim)
np.random.normal(mean,sd,dim)

np.random.random((2,3))

array([[0.24073359, 0.27632724, 0.82718012],
       [0.92877835, 0.90098544, 0.45848069]])

np.random.normal(0,1,(2,3))

array([[ 0.67648559, -0.28466734, -0.04841429],
       [ 1.30753465, -0.5097181 ,  0.81616902]])

numpy array - 取值

Will be covered in Ch 5

numpy array - 新增修改

Will be covered in Ch 5

numpy array - 計算功能

sum()、mean()、std()、cumsum()、max()、min()、count()

print(a)

[0.  0.5 1.  1.5 2. ]

print(a.sum())

5.0

print(a.mean())

1.0

print(a.std())

0.7071067811865476

print(a.cumsum())

[0.  0.5 1.5 3.  5. ]

print(two_d)

[[1 2 3]
 [4 5 6]]

print(two_d.sum())

print(two_d.mean())

3.5

print(two_d.std())

1.707825127659933

print(two_d.cumsum())

[ 1  3  6 10 15 21]

numpy array - 計算功能

sum()、mean()、std()、cumsum()、max()、min()、count()

axis= 0 BY COLUMN, 1 BY ROW

print(two_d)

[[1 2 3]
 [4 5 6]]

print(two_d.sum(axis=0))

[5 7 9]

print(two_d.sum(axis=1))

[ 6 15]

print(two_d.mean(axis=0))

[2.5 3.5 4.5]

print(two_d.mean(axis=1))

[2. 5.]

Hands-on

The first student (index) has wrong score, the correct score is 73. Please correct the score
Please compute the average score and determine which students passed (scores above a certain threshold)

import numpy as np
scores = np.array([75, 82, 90, 65, 88, 55, 66, 77, 44, 100])

Common Data Structure 資料結構

序列 (sequence)
- 表 (list)
- 定值表 (tuple)
- 範圍 (range)
映射 (mapping)
- 字典 (dist)

矩陣 (matrix, array)
- 一般矩陣
- numpy array
資料框 (data frame)
- pandas 資料框

pandas

Series, DataFrame, and Index

pandas Series

使用前須載入pandas library
one-dimensional array of indexed data
pd.Series(list)

import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

print(data[1])

0.5

pandas Series vs. numpy 1d array

index

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

print(data['b'])

0.5

pandas Series from dictionary

pd.Series(dictionary)

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population)

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

pandas DataFrame 資料框

a two-dimensional array with both flexible row indices and flexible column names
a sequence of aligned Series objects
heterogeneous types and/or missing data

df1 = pd.DataFrame({"ID":[1, 2, 3, 4],
                    "Name":["Tom","Emma","Ryan","Amy"]})
df1

	ID	Name
0	1	Tom
1	2	Emma
2	3	Ryan
3	4	Amy

pandas DataFrame 資料框

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
area_dict = {'California': 423967, 
              'Texas': 695662, 
              'New York': 141297,
              'Florida': 170312,
              'Illinois': 149995}
area = pd.Series(area_dict)
states = pd.DataFrame({'population': population,
                       'area': area})
states

	population	area
California	38332521	423967
Texas	26448193	695662
New York	19651127	141297
Florida	19552860	170312
Illinois	12882135	149995

pandas DataFrame from numpy array

pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

	foo	bar
a	0.637051	0.920335
b	0.691742	0.210806
c	0.303617	0.441685

pandas 資料框 - 取值

資料框物件名稱.head(資料筆數)：取前幾筆

df1.head(2)

	ID	Name
0	1	Tom
1	2	Emma

資料框物件名稱.head(資料筆數)：取後幾筆

df1.tail(2)

	ID	Name
2	3	Ryan
3	4	Amy

pandas 資料框 - 取值 (column)

資料框物件名稱[欄位名稱]：取出sequence

df1["Name"]

0     Tom
1    Emma
2    Ryan
3     Amy
Name: Name, dtype: object

資料框物件名稱[[欄位名稱]]：取出資料框

df1[["Name"]]

	Name
0	Tom
1	Emma
2	Ryan
3	Amy

pandas 資料框 - 取值 (row)

資料框物件名稱[row slice]

df1[0:1]

	ID	Name
0	1	Tom

Hands-on

新增一個儲存學生學號、姓名、成績的pandas資料框，並生成5筆資料
試著取出學生成績欄位
試著取出學生姓名與成績兩個欄位
試著取出第3位學生的成績 (hint: sequence)

pandas 資料框 - 設定index

預設index為0~n的序列，可用index參數修改

df_index = pd.DataFrame({"ID":[1, 2, 3, 4],
                          "Name":["Tom","Emma","Ryan","Amy"]},
                          index=["a","b","c","d"])
df_index

	ID	Name
a	1	Tom
b	2	Emma
c	3	Ryan
d	4	Amy

pandas 資料框 - 設定index

也可用已有的data frame設定，透過.set_index(欄位名稱)

df_index2 = df1.set_index("ID")
df_index2

	Name
ID
1	Tom
2	Emma
3	Ryan
4	Amy

pandas 資料框 - 新增修改

pd.concat([pd物件1, pd物件2])，預設為row方向的合併

df1 = pd.DataFrame({"ID":[1, 2, 3, 4],
                    "Name":["Tom","Emma","Ryan","Amy"]})
df2 = pd.DataFrame({"ID":[5,6,7],
                    "Name":["A","B","C"]})                 
newdf = pd.concat([df1,df2])
print(newdf)

   ID  Name
0   1   Tom
1   2  Emma
2   3  Ryan
3   4   Amy
0   5     A
1   6     B
2   7     C

Check Data Type

使用type()函數可查看資料結構

type(dist1)

dict

Common Data Structure 資料結構 - Recap

序列 (sequence)
- 表 (list) [,]
- 定值表 (tuple) (,)
- 範圍 (range) range(s,e,i)
映射 (mapping)
- 字典 (dist) {key:value,key:value}

矩陣 (matrix, array)
- 一般矩陣
- numpy array
資料框 (data frame)
- pandas 資料框

References

Python for Everybody
- Some contents are from Python for Everybody, and these contents are Copyright 2010- Charles R. Severance (www.dr-chuck.com) of the University of Michigan School of Information and open.umich.edu and made available under a Creative Commons Attribution 4.0 License.
Python Data Science Handbook

3. Data Structure

Common Data Structure 資料結構

Common Data Structure 資料結構

序列 (sequence)

表 (list)

list and for loop

list 取值

list 取值 - slice

範圍 (range)

範圍 (range)

list and for loop - range

list 新增修改

list 新增修改 +

list 新增修改 append

list的運算功能

定值表 (tuple)

定值表 (tuple) assignment

Hands-on

Common Data Structure 資料結構

映射 (mapping)

字典 Dictionaries (dist)

mapping 取值 [ ]

mapping 取值 .keys()

mapping 取值 .values()

mapping 取值 .items()

mapping 新增修改 .update()

mapping 新增修改 [ ]

mapping 新增修改 .get()

Hands-on

mapping + for

mapping + for - key and value

Hands-on

Common Data Structure 資料結構

矩陣 (matrix, array)

一般矩陣 (matrix, array)

numpy array - 1維

numpy array - 1維

numpy array - 2維

numpy array - Creating Arrays from Scratch

numpy array - 取值

numpy array - 新增修改

numpy array - 計算功能

numpy array - 計算功能

Hands-on

Common Data Structure 資料結構

pandas

pandas Series

pandas Series vs. numpy 1d array

pandas Series from dictionary

pandas DataFrame 資料框

pandas DataFrame 資料框

pandas DataFrame from numpy array

pandas 資料框 - 取值

pandas 資料框 - 取值 (column)

pandas 資料框 - 取值 (row)

Hands-on

pandas 資料框 - 設定index

pandas 資料框 - 設定index

pandas 資料框 - 新增修改

Check Data Type

Common Data Structure 資料結構 - Recap

References

Questions?