Ostatnia aktualizacja: 2022-05-16 09:13:00

Pandas

Pandas jest biblioteką Pythona służącą do analizy i manipulowania danymi

Import:

import pandas as pd

Podstawowe byty

Seria - Series


Ramka danych - DataFrame

import pandas as pd
import numpy as np

s = pd.Series([3, -5, 7, 4])
print(s)
print(s.values)
print(type(s.values))
t = np.sort(s.values)
print(t)
print(s.index)
print(type(s.index))
## 0    3
## 1   -5
## 2    7
## 3    4
## dtype: int64
## [ 3 -5  7  4]
## <class 'numpy.ndarray'>
## [-5  3  4  7]
## RangeIndex(start=0, stop=4, step=1)
## <class 'pandas.core.indexes.range.RangeIndex'>
import pandas as pd
import numpy as np

s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
print(s)
print(s['b'])
s['b'] = 8
print(s)
print(s[s > 5])
print(s * 2)
print(np.sin(s))
## a    3
## b   -5
## c    7
## d    4
## dtype: int64
## -5
## a    3
## b    8
## c    7
## d    4
## dtype: int64
## b    8
## c    7
## dtype: int64
## a     6
## b    16
## c    14
## d     8
## dtype: int64
## a    0.141120
## b    0.989358
## c    0.656987
## d   -0.756802
## dtype: float64
import pandas as pd

d = {'key1': 350, 'key2': 700, 'key3': 70}
s = pd.Series(d)
print(s)
## key1    350
## key2    700
## key3     70
## dtype: int64
import pandas as pd

d = {'key1': 350, 'key2': 700, 'key3': 70}
k = ['key0', 'key2', 'key3', 'key1']
s = pd.Series(d, index=k)
print(s)
pd.isnull(s)
pd.notnull(s)
s.isnull()
s.notnull()
s.name = "Wartosc"
s.index.name = "Klucz"
print(s)
## key0      NaN
## key2    700.0
## key3     70.0
## key1    350.0
## dtype: float64
## key0     True
## key2    False
## key3    False
## key1    False
## dtype: bool
## key0    False
## key2     True
## key3     True
## key1     True
## dtype: bool
## key0     True
## key2    False
## key3    False
## key1    False
## dtype: bool
## key0    False
## key2     True
## key3     True
## key1     True
## dtype: bool
## Klucz
## key0      NaN
## key2    700.0
## key3     70.0
## key1    350.0
## Name: Wartosc, dtype: float64
import pandas as pd

data = {'Country': ['Belgium', 'India', 'Brazil'],
        'Capital': ['Brussels', 'New Delhi', 'Brasília'],
        'Population': [11190846, 1303171035, 207847528]}
frame = pd.DataFrame(data)
print(frame)
df = pd.DataFrame(data, columns=['Country', 'Capital',
                                 'Population'])
print(df)
print(df.iloc[[0], [0]])
print(df.loc[[0], ['Country']])
print(df.loc[2])
print(df.loc[:, 'Capital'])
print(df.loc[1, 'Capital'])
print(df[df['Population'] > 1200000000])
print(df.drop('Country', axis=1))
print(df.shape)
print(df.index)
print(df.columns)
print(df.info())
print(df.count())
##    Country    Capital  Population
## 0  Belgium   Brussels    11190846
## 1    India  New Delhi  1303171035
## 2   Brazil   Brasília   207847528
##    Country    Capital  Population
## 0  Belgium   Brussels    11190846
## 1    India  New Delhi  1303171035
## 2   Brazil   Brasília   207847528
##    Country
## 0  Belgium
##    Country
## 0  Belgium
## Country          Brazil
## Capital        Brasília
## Population    207847528
## Name: 2, dtype: object
## 0     Brussels
## 1    New Delhi
## 2     Brasília
## Name: Capital, dtype: object
## New Delhi
##   Country    Capital  Population
## 1   India  New Delhi  1303171035
##      Capital  Population
## 0   Brussels    11190846
## 1  New Delhi  1303171035
## 2   Brasília   207847528
## (3, 3)
## RangeIndex(start=0, stop=3, step=1)
## Index(['Country', 'Capital', 'Population'], dtype='object')
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 3 entries, 0 to 2
## Data columns (total 3 columns):
##  #   Column      Non-Null Count  Dtype 
## ---  ------      --------------  ----- 
##  0   Country     3 non-null      object
##  1   Capital     3 non-null      object
##  2   Population  3 non-null      int64 
## dtypes: int64(1), object(2)
## memory usage: 200.0+ bytes
## None
## Country       3
## Capital       3
## Population    3
## dtype: int64

Uzupełnianie braków

import pandas as pd

s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
print(s + s2)
print(s.add(s2, fill_value=0))
print(s.mul(s2, fill_value=2))
## a    10.0
## b     NaN
## c     5.0
## d     7.0
## dtype: float64
## a    10.0
## b    -5.0
## c     5.0
## d     7.0
## dtype: float64
## a    21.0
## b   -10.0
## c   -14.0
## d    12.0
## dtype: float64

Obsługa plików csv

Funkcja pandas.read_csv

Dokumentacja: link

Zapis pandas.DataFrame.to_csv

Dokumentacja: link

Obsługa plików z Excela

Funkcja pandas.read_excel

https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

** Ważne: trzeba zainstalować bibliotekę openpyxl do importu .xlsx oraz xlrd do importu .xls (nie trzeba ich importować w kodzie jawnie w większości wypadków)

Operacje manipulacyjne

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html?highlight=join#pandas.DataFrame.join

https://pandas.pydata.org/docs/reference/api/pandas.concat.html?highlight=concat#pandas.concat

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html?highlight=pivot#pandas.DataFrame.pivot

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.melt.html?highlight=melt#pandas.DataFrame.melt

“Tidy data”

Imię Wiek Wzrost Kolor oczu
Adam 26 167 Brązowe
Sylwia 34 164 Piwne
Tomasz 42 183 Niebieskie

Obsługa brakujących danych

import numpy as np
import pandas as pd

string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
print(string_data)
print(string_data.isnull())
print(string_data.dropna())
## 0     aardvark
## 1    artichoke
## 2          NaN
## 3      avocado
## dtype: object
## 0    False
## 1    False
## 2     True
## 3    False
## dtype: bool
## 0     aardvark
## 1    artichoke
## 3      avocado
## dtype: object

from numpy import nan as NA
import pandas as pd

data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA], 
                     [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
print(cleaned)
print(data.dropna(how='all'))
data[4] = NA
print(data.dropna(how='all', axis=1))
print(data)
print(data.fillna(0))
print(data.fillna({1: 0.5, 2: 0}))
##      0    1    2
## 0  1.0  6.5  3.0
##      0    1    2
## 0  1.0  6.5  3.0
## 1  1.0  NaN  NaN
## 3  NaN  6.5  3.0
##      0    1    2
## 0  1.0  6.5  3.0
## 1  1.0  NaN  NaN
## 2  NaN  NaN  NaN
## 3  NaN  6.5  3.0
##      0    1    2   4
## 0  1.0  6.5  3.0 NaN
## 1  1.0  NaN  NaN NaN
## 2  NaN  NaN  NaN NaN
## 3  NaN  6.5  3.0 NaN
##      0    1    2    4
## 0  1.0  6.5  3.0  0.0
## 1  1.0  0.0  0.0  0.0
## 2  0.0  0.0  0.0  0.0
## 3  0.0  6.5  3.0  0.0
##      0    1    2   4
## 0  1.0  6.5  3.0 NaN
## 1  1.0  0.5  0.0 NaN
## 2  NaN  0.5  0.0 NaN
## 3  NaN  6.5  3.0 NaN

Usuwanie duplikatów

import pandas as pd

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
print(data)
print(data.duplicated())
print(data.drop_duplicates())
##     k1  k2
## 0  one   1
## 1  two   1
## 2  one   2
## 3  two   3
## 4  one   3
## 5  two   4
## 6  two   4
## 0    False
## 1    False
## 2    False
## 3    False
## 4    False
## 5    False
## 6     True
## dtype: bool
##     k1  k2
## 0  one   1
## 1  two   1
## 2  one   2
## 3  two   3
## 4  one   3
## 5  two   4

Zastępowanie wartościami

import pandas as pd
import numpy as np

data = pd.Series([1., -999., 2., -999., -1000., 3.])
print(data)
print(data.replace(-999, np.nan))
print(data.replace([-999, -1000], np.nan))
print(data.replace([-999, -1000], [np.nan, 0]))
print(data.replace({-999: np.nan, -1000: 0}))
## 0       1.0
## 1    -999.0
## 2       2.0
## 3    -999.0
## 4   -1000.0
## 5       3.0
## dtype: float64
## 0       1.0
## 1       NaN
## 2       2.0
## 3       NaN
## 4   -1000.0
## 5       3.0
## dtype: float64
## 0    1.0
## 1    NaN
## 2    2.0
## 3    NaN
## 4    NaN
## 5    3.0
## dtype: float64
## 0    1.0
## 1    NaN
## 2    2.0
## 3    NaN
## 4    0.0
## 5    3.0
## dtype: float64
## 0    1.0
## 1    NaN
## 2    2.0
## 3    NaN
## 4    0.0
## 5    3.0
## dtype: float64

Dyskretyzacja i podział na koszyki

import pandas as pd

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
print(cats)
print(cats.codes)
print(cats.categories)
print(pd.value_counts(cats))
## [(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
## Length: 12
## Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
## [0 0 0 1 0 0 2 1 3 2 2 1]
## IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')
## (18, 25]     5
## (25, 35]     3
## (35, 60]     3
## (60, 100]    1
## dtype: int64

import pandas as pd

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]
cats2 = pd.cut(ages, [18, 26, 36, 61, 100], right=False)
print(cats2)
group_names = ['Youth', 'YoungAdult',
               'MiddleAged', 'Senior']
print(pd.cut(ages, bins, labels=group_names))
## [[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
## Length: 12
## Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
## ['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
## Length: 12
## Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

import pandas as pd
import numpy as np

data = np.random.rand(20)
print(pd.cut(data, 4, precision=2))
## [(0.51, 0.74], (0.51, 0.74], (0.038, 0.27], (0.51, 0.74], (0.038, 0.27], ..., (0.51, 0.74], (0.27, 0.51], (0.51, 0.74], (0.74, 0.98], (0.038, 0.27]]
## Length: 20
## Categories (4, interval[float64, right]): [(0.038, 0.27] < (0.27, 0.51] < (0.51, 0.74] <
##                                            (0.74, 0.98]]

import pandas as pd
import numpy as np

data = np.random.randn(1000)
cats = pd.qcut(data, 4)
print(cats)
print(pd.value_counts(cats))
## [(0.684, 2.847], (0.684, 2.847], (0.684, 2.847], (-0.72, -0.0233], (0.684, 2.847], ..., (0.684, 2.847], (-3.5509999999999997, -0.72], (-3.5509999999999997, -0.72], (-0.0233, 0.684], (-3.5509999999999997, -0.72]]
## Length: 1000
## Categories (4, interval[float64, right]): [(-3.5509999999999997, -0.72] < (-0.72, -0.0233] < (-0.0233, 0.684] <
##                                            (0.684, 2.847]]
## (-3.5509999999999997, -0.72]    250
## (-0.72, -0.0233]                250
## (-0.0233, 0.684]                250
## (0.684, 2.847]                  250
## dtype: int64

Wykrywanie i filtrowanie elementów odstających

import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(1000, 4))
print(data.describe())
col = data[2]
print(col[np.abs(col) > 3])
print(data[(np.abs(data) > 3).any(1)])
##                  0            1            2            3
## count  1000.000000  1000.000000  1000.000000  1000.000000
## mean     -0.015821    -0.004140     0.004329    -0.037815
## std       0.997786     0.998337     1.000430     1.034338
## min      -3.070206    -3.860502    -3.024088    -4.350507
## 25%      -0.681008    -0.672984    -0.650718    -0.727218
## 50%      -0.013860    -0.014777     0.001613    -0.090619
## 75%       0.662763     0.689373     0.634083     0.720808
## max       2.877078     3.121150     3.444146     2.906202
## 112   -3.024088
## 569    3.444146
## 620    3.188666
## 808    3.121459
## 913    3.098508
## Name: 2, dtype: float64
##             0         1         2         3
## 89  -0.936605 -3.390212  0.396429 -1.133436
## 101 -0.552159 -0.404804  1.429967 -4.350507
## 112  0.223958  0.707826 -3.024088  0.071877
## 114 -0.315249  0.620354  1.021745 -3.147838
## 172 -0.641095 -3.860502  2.465383 -0.237463
## 303 -3.070206 -1.947605  1.707185  0.913915
## 481  0.962003  3.121150  1.312531  0.249062
## 507 -0.922713  0.811356  0.501193 -3.539931
## 569  0.679868  0.243619  3.444146  2.063614
## 620 -0.170758 -0.865947  3.188666 -0.833480
## 791  1.310164  3.043830  1.394615 -1.177669
## 808 -0.118275  0.763414  3.121459 -2.251621
## 811  0.344929  0.422927  0.187496 -3.174217
## 913  1.313085 -1.270709  3.098508 -0.868183
## 966  0.400301 -3.323199  0.702132 -0.433960

Bibliografia: