Ostatnia aktualizacja: 2022-05-16 09:13:00
Pandas
Pandas jest biblioteką Pythona służącą do analizy i manipulowania danymi
Import:
import pandas as pd
Podstawowe byty
Seria - Series
Ramka danych - DataFrame
import pandas as pd
import numpy as np
= pd.Series([3, -5, 7, 4])
s print(s)
print(s.values)
print(type(s.values))
= np.sort(s.values)
t print(t)
print(s.index)
print(type(s.index))
## 0 3
## 1 -5
## 2 7
## 3 4
## dtype: int64
## [ 3 -5 7 4]
## <class 'numpy.ndarray'>
## [-5 3 4 7]
## RangeIndex(start=0, stop=4, step=1)
## <class 'pandas.core.indexes.range.RangeIndex'>
import pandas as pd
import numpy as np
= pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
s print(s)
print(s['b'])
'b'] = 8
s[print(s)
print(s[s > 5])
print(s * 2)
print(np.sin(s))
## a 3
## b -5
## c 7
## d 4
## dtype: int64
## -5
## a 3
## b 8
## c 7
## d 4
## dtype: int64
## b 8
## c 7
## dtype: int64
## a 6
## b 16
## c 14
## d 8
## dtype: int64
## a 0.141120
## b 0.989358
## c 0.656987
## d -0.756802
## dtype: float64
import pandas as pd
= {'key1': 350, 'key2': 700, 'key3': 70}
d = pd.Series(d)
s print(s)
## key1 350
## key2 700
## key3 70
## dtype: int64
import pandas as pd
= {'key1': 350, 'key2': 700, 'key3': 70}
d = ['key0', 'key2', 'key3', 'key1']
k = pd.Series(d, index=k)
s print(s)
pd.isnull(s)
pd.notnull(s)
s.isnull()
s.notnull()= "Wartosc"
s.name = "Klucz"
s.index.name print(s)
## key0 NaN
## key2 700.0
## key3 70.0
## key1 350.0
## dtype: float64
## key0 True
## key2 False
## key3 False
## key1 False
## dtype: bool
## key0 False
## key2 True
## key3 True
## key1 True
## dtype: bool
## key0 True
## key2 False
## key3 False
## key1 False
## dtype: bool
## key0 False
## key2 True
## key3 True
## key1 True
## dtype: bool
## Klucz
## key0 NaN
## key2 700.0
## key3 70.0
## key1 350.0
## Name: Wartosc, dtype: float64
import pandas as pd
= {'Country': ['Belgium', 'India', 'Brazil'],
data 'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [11190846, 1303171035, 207847528]}
= pd.DataFrame(data)
frame print(frame)
= pd.DataFrame(data, columns=['Country', 'Capital',
df 'Population'])
print(df)
print(df.iloc[[0], [0]])
print(df.loc[[0], ['Country']])
print(df.loc[2])
print(df.loc[:, 'Capital'])
print(df.loc[1, 'Capital'])
print(df[df['Population'] > 1200000000])
print(df.drop('Country', axis=1))
print(df.shape)
print(df.index)
print(df.columns)
print(df.info())
print(df.count())
## Country Capital Population
## 0 Belgium Brussels 11190846
## 1 India New Delhi 1303171035
## 2 Brazil Brasília 207847528
## Country Capital Population
## 0 Belgium Brussels 11190846
## 1 India New Delhi 1303171035
## 2 Brazil Brasília 207847528
## Country
## 0 Belgium
## Country
## 0 Belgium
## Country Brazil
## Capital Brasília
## Population 207847528
## Name: 2, dtype: object
## 0 Brussels
## 1 New Delhi
## 2 Brasília
## Name: Capital, dtype: object
## New Delhi
## Country Capital Population
## 1 India New Delhi 1303171035
## Capital Population
## 0 Brussels 11190846
## 1 New Delhi 1303171035
## 2 Brasília 207847528
## (3, 3)
## RangeIndex(start=0, stop=3, step=1)
## Index(['Country', 'Capital', 'Population'], dtype='object')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 3 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Country 3 non-null object
## 1 Capital 3 non-null object
## 2 Population 3 non-null int64
## dtypes: int64(1), object(2)
## memory usage: 200.0+ bytes
## None
## Country 3
## Capital 3
## Population 3
## dtype: int64
Uzupełnianie braków
import pandas as pd
= pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
s = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
s2 print(s + s2)
print(s.add(s2, fill_value=0))
print(s.mul(s2, fill_value=2))
## a 10.0
## b NaN
## c 5.0
## d 7.0
## dtype: float64
## a 10.0
## b -5.0
## c 5.0
## d 7.0
## dtype: float64
## a 21.0
## b -10.0
## c -14.0
## d 12.0
## dtype: float64
Obsługa plików csv
Funkcja pandas.read_csv
Dokumentacja: link
Zapis pandas.DataFrame.to_csv
Dokumentacja: link
Obsługa plików z Excela
Funkcja pandas.read_excel
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
** Ważne: trzeba zainstalować bibliotekę openpyxl
do
importu .xlsx oraz xlrd
do importu .xls (nie trzeba ich
importować w kodzie jawnie w większości wypadków)
Operacje manipulacyjne
merge
join
concat
https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat#pandas.concat
pivot
melt
“Tidy data”
Imię | Wiek | Wzrost | Kolor oczu |
---|---|---|---|
Adam | 26 | 167 | Brązowe |
Sylwia | 34 | 164 | Piwne |
Tomasz | 42 | 183 | Niebieskie |
- jedna obserwacja (jednostka statystyczna) = jeden wiersz w tabeli/macierzy/ramce danych
- wartosci danej cechy znajduja sie w kolumnach
- jeden typ/rodzaj obserwacji w jednej tabeli/macierzy/ramce danych
Obsługa brakujących danych
import numpy as np
import pandas as pd
= pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data print(string_data)
print(string_data.isnull())
print(string_data.dropna())
## 0 aardvark
## 1 artichoke
## 2 NaN
## 3 avocado
## dtype: object
## 0 False
## 1 False
## 2 True
## 3 False
## dtype: bool
## 0 aardvark
## 1 artichoke
## 3 avocado
## dtype: object
from numpy import nan as NA
import pandas as pd
= pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
data 6.5, 3.]])
[NA, NA, NA], [NA, = data.dropna()
cleaned print(cleaned)
print(data.dropna(how='all'))
4] = NA
data[print(data.dropna(how='all', axis=1))
print(data)
print(data.fillna(0))
print(data.fillna({1: 0.5, 2: 0}))
## 0 1 2
## 0 1.0 6.5 3.0
## 0 1 2
## 0 1.0 6.5 3.0
## 1 1.0 NaN NaN
## 3 NaN 6.5 3.0
## 0 1 2
## 0 1.0 6.5 3.0
## 1 1.0 NaN NaN
## 2 NaN NaN NaN
## 3 NaN 6.5 3.0
## 0 1 2 4
## 0 1.0 6.5 3.0 NaN
## 1 1.0 NaN NaN NaN
## 2 NaN NaN NaN NaN
## 3 NaN 6.5 3.0 NaN
## 0 1 2 4
## 0 1.0 6.5 3.0 0.0
## 1 1.0 0.0 0.0 0.0
## 2 0.0 0.0 0.0 0.0
## 3 0.0 6.5 3.0 0.0
## 0 1 2 4
## 0 1.0 6.5 3.0 NaN
## 1 1.0 0.5 0.0 NaN
## 2 NaN 0.5 0.0 NaN
## 3 NaN 6.5 3.0 NaN
Usuwanie duplikatów
import pandas as pd
= pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
data 'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data)
print(data.duplicated())
print(data.drop_duplicates())
## k1 k2
## 0 one 1
## 1 two 1
## 2 one 2
## 3 two 3
## 4 one 3
## 5 two 4
## 6 two 4
## 0 False
## 1 False
## 2 False
## 3 False
## 4 False
## 5 False
## 6 True
## dtype: bool
## k1 k2
## 0 one 1
## 1 two 1
## 2 one 2
## 3 two 3
## 4 one 3
## 5 two 4
Zastępowanie wartościami
import pandas as pd
import numpy as np
= pd.Series([1., -999., 2., -999., -1000., 3.])
data print(data)
print(data.replace(-999, np.nan))
print(data.replace([-999, -1000], np.nan))
print(data.replace([-999, -1000], [np.nan, 0]))
print(data.replace({-999: np.nan, -1000: 0}))
## 0 1.0
## 1 -999.0
## 2 2.0
## 3 -999.0
## 4 -1000.0
## 5 3.0
## dtype: float64
## 0 1.0
## 1 NaN
## 2 2.0
## 3 NaN
## 4 -1000.0
## 5 3.0
## dtype: float64
## 0 1.0
## 1 NaN
## 2 2.0
## 3 NaN
## 4 NaN
## 5 3.0
## dtype: float64
## 0 1.0
## 1 NaN
## 2 2.0
## 3 NaN
## 4 0.0
## 5 3.0
## dtype: float64
## 0 1.0
## 1 NaN
## 2 2.0
## 3 NaN
## 4 0.0
## 5 3.0
## dtype: float64
Dyskretyzacja i podział na koszyki
import pandas as pd
= [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
ages = [18, 25, 35, 60, 100]
bins = pd.cut(ages, bins)
cats print(cats)
print(cats.codes)
print(cats.categories)
print(pd.value_counts(cats))
## [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
## Length: 12
## Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
## [0 0 0 1 0 0 2 1 3 2 2 1]
## IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')
## (18, 25] 5
## (25, 35] 3
## (35, 60] 3
## (60, 100] 1
## dtype: int64
import pandas as pd
= [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
ages = [18, 25, 35, 60, 100]
bins = pd.cut(ages, [18, 26, 36, 61, 100], right=False)
cats2 print(cats2)
= ['Youth', 'YoungAdult',
group_names 'MiddleAged', 'Senior']
print(pd.cut(ages, bins, labels=group_names))
## [[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
## Length: 12
## Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
## ['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
## Length: 12
## Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']
import pandas as pd
import numpy as np
= np.random.rand(20)
data print(pd.cut(data, 4, precision=2))
## [(0.51, 0.74], (0.51, 0.74], (0.038, 0.27], (0.51, 0.74], (0.038, 0.27], ..., (0.51, 0.74], (0.27, 0.51], (0.51, 0.74], (0.74, 0.98], (0.038, 0.27]]
## Length: 20
## Categories (4, interval[float64, right]): [(0.038, 0.27] < (0.27, 0.51] < (0.51, 0.74] <
## (0.74, 0.98]]
import pandas as pd
import numpy as np
= np.random.randn(1000)
data = pd.qcut(data, 4)
cats print(cats)
print(pd.value_counts(cats))
## [(0.684, 2.847], (0.684, 2.847], (0.684, 2.847], (-0.72, -0.0233], (0.684, 2.847], ..., (0.684, 2.847], (-3.5509999999999997, -0.72], (-3.5509999999999997, -0.72], (-0.0233, 0.684], (-3.5509999999999997, -0.72]]
## Length: 1000
## Categories (4, interval[float64, right]): [(-3.5509999999999997, -0.72] < (-0.72, -0.0233] < (-0.0233, 0.684] <
## (0.684, 2.847]]
## (-3.5509999999999997, -0.72] 250
## (-0.72, -0.0233] 250
## (-0.0233, 0.684] 250
## (0.684, 2.847] 250
## dtype: int64
Wykrywanie i filtrowanie elementów odstających
import pandas as pd
import numpy as np
= pd.DataFrame(np.random.randn(1000, 4))
data print(data.describe())
= data[2]
col print(col[np.abs(col) > 3])
print(data[(np.abs(data) > 3).any(1)])
## 0 1 2 3
## count 1000.000000 1000.000000 1000.000000 1000.000000
## mean -0.015821 -0.004140 0.004329 -0.037815
## std 0.997786 0.998337 1.000430 1.034338
## min -3.070206 -3.860502 -3.024088 -4.350507
## 25% -0.681008 -0.672984 -0.650718 -0.727218
## 50% -0.013860 -0.014777 0.001613 -0.090619
## 75% 0.662763 0.689373 0.634083 0.720808
## max 2.877078 3.121150 3.444146 2.906202
## 112 -3.024088
## 569 3.444146
## 620 3.188666
## 808 3.121459
## 913 3.098508
## Name: 2, dtype: float64
## 0 1 2 3
## 89 -0.936605 -3.390212 0.396429 -1.133436
## 101 -0.552159 -0.404804 1.429967 -4.350507
## 112 0.223958 0.707826 -3.024088 0.071877
## 114 -0.315249 0.620354 1.021745 -3.147838
## 172 -0.641095 -3.860502 2.465383 -0.237463
## 303 -3.070206 -1.947605 1.707185 0.913915
## 481 0.962003 3.121150 1.312531 0.249062
## 507 -0.922713 0.811356 0.501193 -3.539931
## 569 0.679868 0.243619 3.444146 2.063614
## 620 -0.170758 -0.865947 3.188666 -0.833480
## 791 1.310164 3.043830 1.394615 -1.177669
## 808 -0.118275 0.763414 3.121459 -2.251621
## 811 0.344929 0.422927 0.187496 -3.174217
## 913 1.313085 -1.270709 3.098508 -0.868183
## 966 0.400301 -3.323199 0.702132 -0.433960
Bibliografia:
- Dokumentacja biblioteki, https://pandas.pydata.org/, dostęp online 5.03.2021.
- Hannah Stepanek, Thinking in Pandas, How to Use the Python Data Analysis Library the Right Way, Apress, 2020.