3. Data Structure

Yi-Ju Tseng

Common Data Structure 資料結構

A particular way of organizing data in a computer

  • 序列 (sequence)
    • 表 (list)
    • 定值表 (tuple)
    • 範圍 (range)
  • 映射 (mapping)
    • 字典 (dist)
  • 矩陣 (matrix, array)
    • 一般矩陣
    • numpy array
  • 資料框 (data frame)
    • pandas 資料框

Common Data Structure 資料結構

  • 序列 (sequence)
    • 表 (list)
    • 定值表 (tuple)
    • 範圍 (range)
  • 映射 (mapping)
    • 字典 (dist)
  • 矩陣 (matrix, array)
    • 一般矩陣
    • numpy array
  • 資料框 (data frame)
    • pandas 資料框

序列 (sequence)

表 (list)

A collection allows us to put many values in a single “variable”

  • surrounded by square brackets[ ]
  • separated by commas,
  • element can be any Python object
  • can be empty
  • 可作運算
list1 = [3.5, 4.6, 5.7]
print(list1)
[3.5, 4.6, 5.7]

list and for loop

依序將list結構的內容取出,設定為x,並印出來

friends = ['Joseph', 'Glenn', 'Sally']
for x in friends:
    print('Happy New Year:',  x)
print('Done!')
Happy New Year: Joseph
Happy New Year: Glenn
Happy New Year: Sally
Done!

list 取值

  • sequence包含:list, tuple, range
  • 使用中括號加上index取值變數名稱[index]
  • index從0開始
位置 1 2 3 4
index 0 1 2 3
print(friends)
['Joseph', 'Glenn', 'Sally']
print(friends[0])
Joseph
print(friends[1])
Glenn
print(friends[2])
Sally

list 取值 - slice

start:up to but not including

print(friends)
['Joseph', 'Glenn', 'Sally']
print(friends[1:2])
['Glenn']
print(friends[:2])
['Joseph', 'Glenn']
print(friends[:1])
['Joseph']

範圍 (range)

  • 使用range(數列起點,數列終點[不包含],間隔)宣告
  • 若只有輸入一個參數,視為數列終點,起點設為預設值0
  • 若只有輸入兩個參數,則間隔設為預設值1
r1 = range(10)
print(r1)
range(0, 10)
r2 = range(10,20)
print(r2)
range(10, 20)
r3 = range(10,20,2)
print(r3)
range(10, 20, 2)

範圍 (range)

range(0, 10)
for i in r1:
  print(i)
0
1
2
3
4
5
6
7
8
9
range(10, 20, 2)
for i in r3:
  print(i)
10
12
14
16
18

list and for loop - range

friends = ['Joseph', 'Glenn', 'Sally']
print(len(friends))
3
print(range(len(friends)))
range(0, 3)
for x in friends:
    print('Happy New Year:',  x)
print('Done!')
Happy New Year: Joseph
Happy New Year: Glenn
Happy New Year: Sally
Done!
for i in range(len(friends)) :
    friend = friends[i]
    print(friend)
print('Done!')
Joseph
Glenn
Sally
Done!

list 新增修改

  • Lists are “mutable”
    • we can change an element of a list using the index operator
print(friends)
['Joseph', 'Glenn', 'Sally']
friends[0]='new friend'
print(friends)
['new friend', 'Glenn', 'Sally']

list 新增修改 +

We can create a new list by adding + two existing lists together

friend_cs = ['Joseph', 'Glenn', 'Sally']
friend_md = ['Alex', 'John', 'Ray']
print(friend_cs+friend_md)
['Joseph', 'Glenn', 'Sally', 'Alex', 'John', 'Ray']

list 新增修改 append

We can create an empty list and then add elements using the append method

print(friend_cs)
['Joseph', 'Glenn', 'Sally']
friend_cs.append('Megan')
print(friend_cs)
['Joseph', 'Glenn', 'Sally', 'Megan']

list的運算功能

list1 = [3.5, 4.6, 5.7]
print(list1)
[3.5, 4.6, 5.7]

sum(list物件名稱) 物件加總

sum(list1)
13.8

max(list物件名稱) 物件最大值

list1 = [3.5, 4.6, 5.7]
max(list1)
5.7

可參考文件

定值表 (tuple)

  • Like list
  • surrounded by square brackets( )
  • separated by commas,
  • Tuples are “immutable” –> More Efficient
    • No .append(), .sort(), etc
tuple1 = (3.5, 4.6, 5.7, "111", "222")
print(tuple1)
(3.5, 4.6, 5.7, '111', '222')

定值表 (tuple) assignment

We can also put a tuple on the left-hand side of an assignment statement

(x, y) = (4, 'fred')
print(y)
fred

Hands-on

  • 新增一sequence a,包含數字1到10
  • 新增一sequence b,包含數字1到20中的所有偶數
  • 取出a sequence 的第4個值

Common Data Structure 資料結構

  • 序列 (sequence)
    • 表 (list)
    • 定值表 (tuple)
    • 範圍 (range)
  • 映射 (mapping)
    • 字典 (dist)
  • 矩陣 (matrix, array)
    • 一般矩陣
    • numpy array
  • 資料框 (data frame)
    • pandas 資料框

映射 (mapping)

字典 Dictionaries (dist)

  • Dictionaries are Python’s most powerful data collection
  • 使用大括號 { }宣告
  • 內容為key : value的組合,並以,分隔
  • key不能重複,通常為字串
dist1 = {"id":1, "name":"Ryan","age":20, "School":"NYCU"}
print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}

mapping 取值 [ ]

  • mapping包含:dist
  • 使用中括號加上key取值變數名稱[key]
print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}
print(dist1['age'])
20

mapping 取值 .keys()

[key1, key2, …]

print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}
print(dist1.keys())
dict_keys(['id', 'name', 'age', 'School'])

mapping 取值 .values()

[value1, value2, …]

print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}
print(dist1.values())
dict_values([1, 'Ryan', 20, 'NYCU'])

mapping 取值 .items()

[(key1, value1), (key2, value2),…]

print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}
print(dist1.items())
dict_items([('id', 1), ('name', 'Ryan'), ('age', 20), ('School', 'NYCU')])

mapping 新增修改 .update()

  • mapping包含:dist
  • mapping物件.update()可新增或修改內容(key:value pairs)至mapping中
print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 20, 'School': 'NYCU'}
dist1.update({"age":25, "dept":"CS"})
print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 25, 'School': 'NYCU', 'dept': 'CS'}

mapping 新增修改 [ ]

[ ]取值後寫回

print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 25, 'School': 'NYCU', 'dept': 'CS'}
dist1['age']=dist1['age']+10
print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 35, 'School': 'NYCU', 'dept': 'CS'}

mapping 新增修改 .get()

  • .get(key, 0) 取值
    • key: key
    • 0: any default value
print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 35, 'School': 'NYCU', 'dept': 'CS'}
dist1['age']=dist1.get('age', 0)+10
print(dist1)
{'id': 1, 'name': 'Ryan', 'age': 45, 'School': 'NYCU', 'dept': 'CS'}

Hands-on

  • 新增一mapping dist1,內容為
dist1 = {"id":[1,2,3,4], 
        "name":["Ryan","Tom","Emma","Amy"], 
        "School":"NYCU"}
  • 取出id中的第3個值
  • 取出name中的第2個值的第1個字元(字母)

mapping + for

  • mapping結構(dist)預設取得部份為key
  • 使用.values()即可取value (值)
print(dist1)
{'id': [1, 2, 3, 4], 'name': ['Ryan', 'Tom', 'Emma', 'Amy'], 'School': 'NYCU'}

Key

for i in dist1:
  print(i)
id
name
School

Value

for i in dist1.values():
  print(i)
[1, 2, 3, 4]
['Ryan', 'Tom', 'Emma', 'Amy']
NYCU

mapping + for - key and value

print(dist1)
{'id': [1, 2, 3, 4], 'name': ['Ryan', 'Tom', 'Emma', 'Amy'], 'School': 'NYCU'}

先得到key,再取得value

for key in dist1:
  print(key, dist1[key])
id [1, 2, 3, 4]
name ['Ryan', 'Tom', 'Emma', 'Amy']
School NYCU

或是直接使用.items()

for key,value in dist1.items():
  print(key, value)
id [1, 2, 3, 4]
name ['Ryan', 'Tom', 'Emma', 'Amy']
School NYCU

Hands-on

Most Common Name? Edit the ???? parts

names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']

Use dist and for to get the most common name and its count

names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']
counts = dict()
for name in ??? : # edit here
    if name not in counts: 
      counts[name] = ??? # edit here
    else :
      ??? # edit here
print(counts)
names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']
counts = dict()
for name in ??? : # edit here
    counts[name] = ??? + 1 # edit here
print(counts)

Common Data Structure 資料結構

  • 序列 (sequence)
    • 表 (list)
    • 定值表 (tuple)
    • 範圍 (range)
  • 映射 (mapping)
    • 字典 (dist)
  • 矩陣 (matrix, array)
    • 一般矩陣
    • numpy array
  • 資料框 (data frame)
    • pandas 資料框

矩陣 (matrix, array)

一般矩陣 (matrix, array)

  • 可想成sequence (1維矩陣)的堆疊
  • sequence為row的組成
  • 直接以中括號分隔matrix的row和column
Col1 Col2 Col3
1 2 3
4 5 6
matrix1 = [[1, 2, 3], 
          [4, 5, 6]]
print(matrix1)
[[1, 2, 3], [4, 5, 6]]
row1 = [1, 2, 3]
row2 = [4, 5, 6]
matrix2 = [row1, row2]
print(matrix2)
[[1, 2, 3], [4, 5, 6]]

numpy array - 1維

  • from numpy library (Numerical Python)
  • like list type, but arrays provide much more efficient storage and data operations (fixed-type)
  • Use .array(list) to create numpy array from list
import numpy as np
a = np.array([0, 0.5, 1.0, 1.5, 2.0]) 
list_string = ['a', 'b', 'c']
b = np.array(list_string) 
print([a,b])
[array([0. , 0.5, 1. , 1.5, 2. ]), array(['a', 'b', 'c'], dtype='<U1')]

numpy array - 1維

或是用.arrange(起點,終點[不包含],間隔)來生成numpy array,如同range()的用法

c = np.arange(0, 10, 2) 
print(c)
[0 2 4 6 8]

numpy array - 2維

  • 可用兩個sequence組合(一般2維矩陣)新增numpy array
  • 預設以row為維度堆疊
two_d = np.array([row1,row2]) 
print(two_d)
[[1 2 3]
 [4 5 6]]

numpy array - Creating Arrays from Scratch

  • np.full(dim, value): repeated values
  • 2d dimension: (row, column)
np.full((2, 4), 3.14)
array([[3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14]])
  • np.random.random(dim)
  • np.random.normal(mean,sd,dim)
np.random.random((2,3))
array([[0.24073359, 0.27632724, 0.82718012],
       [0.92877835, 0.90098544, 0.45848069]])
np.random.normal(0,1,(2,3))
array([[ 0.67648559, -0.28466734, -0.04841429],
       [ 1.30753465, -0.5097181 ,  0.81616902]])

numpy array - 取值

Will be covered in Ch 5

numpy array - 新增修改

Will be covered in Ch 5

numpy array - 計算功能

sum()mean()std()cumsum()max()min()count()

print(a)
[0.  0.5 1.  1.5 2. ]
print(a.sum())
5.0
print(a.mean())
1.0
print(a.std())
0.7071067811865476
print(a.cumsum())
[0.  0.5 1.5 3.  5. ]
print(two_d)
[[1 2 3]
 [4 5 6]]
print(two_d.sum()) 
21
print(two_d.mean()) 
3.5
print(two_d.std()) 
1.707825127659933
print(two_d.cumsum()) 
[ 1  3  6 10 15 21]

numpy array - 計算功能

sum()mean()std()cumsum()max()min()count()

axis= 0 BY COLUMN, 1 BY ROW

print(two_d)
[[1 2 3]
 [4 5 6]]
print(two_d.sum(axis=0))
[5 7 9]
print(two_d.sum(axis=1))
[ 6 15]
print(two_d.mean(axis=0))
[2.5 3.5 4.5]
print(two_d.mean(axis=1))
[2. 5.]

Hands-on

  • The first student (index) has wrong score, the correct score is 73. Please correct the score
  • Please compute the average score and determine which students passed (scores above a certain threshold)
import numpy as np
scores = np.array([75, 82, 90, 65, 88, 55, 66, 77, 44, 100])

Common Data Structure 資料結構

  • 序列 (sequence)
    • 表 (list)
    • 定值表 (tuple)
    • 範圍 (range)
  • 映射 (mapping)
    • 字典 (dist)
  • 矩陣 (matrix, array)
    • 一般矩陣
    • numpy array
  • 資料框 (data frame)
    • pandas 資料框

pandas

Series, DataFrame, and Index

pandas Series

  • 使用前須載入pandas library
  • one-dimensional array of indexed data
  • pd.Series(list)
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
print(data[1])
0.5

pandas Series vs. numpy 1d array

index

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
print(data['b'])
0.5

pandas Series from dictionary

pd.Series(dictionary)

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population)
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

pandas DataFrame 資料框

  • a two-dimensional array with both flexible row indices and flexible column names
  • a sequence of aligned Series objects
  • heterogeneous types and/or missing data
df1 = pd.DataFrame({"ID":[1, 2, 3, 4],
                    "Name":["Tom","Emma","Ryan","Amy"]})
df1
ID Name
0 1 Tom
1 2 Emma
2 3 Ryan
3 4 Amy

pandas DataFrame 資料框

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
area_dict = {'California': 423967, 
              'Texas': 695662, 
              'New York': 141297,
              'Florida': 170312,
              'Illinois': 149995}
area = pd.Series(area_dict)
states = pd.DataFrame({'population': population,
                       'area': area})
states
population area
California 38332521 423967
Texas 26448193 695662
New York 19651127 141297
Florida 19552860 170312
Illinois 12882135 149995

pandas DataFrame from numpy array

pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])
foo bar
a 0.637051 0.920335
b 0.691742 0.210806
c 0.303617 0.441685

pandas 資料框 - 取值

資料框物件名稱.head(資料筆數):取前幾筆

df1.head(2)
ID Name
0 1 Tom
1 2 Emma

資料框物件名稱.head(資料筆數):取後幾筆

df1.tail(2)
ID Name
2 3 Ryan
3 4 Amy

pandas 資料框 - 取值 (column)

資料框物件名稱[欄位名稱]:取出sequence

df1["Name"]
0     Tom
1    Emma
2    Ryan
3     Amy
Name: Name, dtype: object

資料框物件名稱[[欄位名稱]]:取出資料框

df1[["Name"]]
Name
0 Tom
1 Emma
2 Ryan
3 Amy

pandas 資料框 - 取值 (row)

資料框物件名稱[row slice]

df1[0:1]
ID Name
0 1 Tom

Hands-on

  • 新增一個儲存學生學號、姓名、成績的pandas資料框,並生成5筆資料
  • 試著取出學生成績欄位
  • 試著取出學生姓名與成績兩個欄位
  • 試著取出第3位學生的成績 (hint: sequence)

pandas 資料框 - 設定index

預設index為0~n的序列,可用index參數修改

df_index = pd.DataFrame({"ID":[1, 2, 3, 4],
                          "Name":["Tom","Emma","Ryan","Amy"]},
                          index=["a","b","c","d"])
df_index
ID Name
a 1 Tom
b 2 Emma
c 3 Ryan
d 4 Amy

pandas 資料框 - 設定index

也可用已有的data frame設定,透過.set_index(欄位名稱)

df_index2 = df1.set_index("ID")
df_index2
Name
ID
1 Tom
2 Emma
3 Ryan
4 Amy

pandas 資料框 - 新增修改

pd.concat([pd物件1, pd物件2]),預設為row方向的合併

df1 = pd.DataFrame({"ID":[1, 2, 3, 4],
                    "Name":["Tom","Emma","Ryan","Amy"]})
df2 = pd.DataFrame({"ID":[5,6,7],
                    "Name":["A","B","C"]})                 
newdf = pd.concat([df1,df2])
print(newdf)
   ID  Name
0   1   Tom
1   2  Emma
2   3  Ryan
3   4   Amy
0   5     A
1   6     B
2   7     C

Check Data Type

使用type()函數可查看資料結構

type(dist1)
dict

Common Data Structure 資料結構 - Recap

  • 序列 (sequence)
    • 表 (list) [,]
    • 定值表 (tuple) (,)
    • 範圍 (range) range(s,e,i)
  • 映射 (mapping)
    • 字典 (dist) {key:value,key:value}
  • 矩陣 (matrix, array)
    • 一般矩陣
    • numpy array
  • 資料框 (data frame)
    • pandas 資料框

References

  • Python for Everybody
    • Some contents are from Python for Everybody, and these contents are Copyright 2010- Charles R. Severance (www.dr-chuck.com) of the University of Michigan School of Information and open.umich.edu and made available under a Creative Commons Attribution 4.0 License.
  • Python Data Science Handbook

Questions?