pandas系列(三)Pandas分組

2020-12-27 酷扯兒

本文轉載自【微信公眾號：五角錢的程式設計師，ID:xianglin965】經微信公眾號授權轉載，如需轉載與原文作者聯繫

文章目錄

第3章分組一、SAC過程1. 內涵2. apply過程二、groupby函數1. 分組函數的基本內容：2. groupby對象的特點三、聚合、過濾和變換1. 聚合（Aggregation）2. 過濾（Filteration）3. 變換（Transformation）四、apply函數1. apply函數的靈活性2. 用apply同時統計多個指標

第3章分組

import numpy as np

import pandas as pd

df = pd.read_csv('data/table.csv',index_col='ID')

一、SAC過程

1. 內涵

1.SAC指的是分組操作中的split-apply-combine過程

2.其中split指基於某一些規則，將數據拆成若干組，apply是指對每一組獨立地使用函

數，combine指將每一組的結果組合成某一類數據結構

2. apply過程

在該過程中，我們實際往往會遇到四類問題：

1.整合（Aggregation）——即分組計算統計量（如求均值、求每組元素個數）

2.變換（Transformation）——即分組對每個單元的數據進行操作（如元素標準化）

3.過濾（Filtration）——即按照某些規則篩選出一些組（如選出組內某一指標小於50的組）

綜合問題——即前面提及的三種問題的混合

二、groupby函數

1. 分組函數的基本內容：

（a）根據某一列分組

grouped_single = df.groupby('School')

經過groupby後會生成一個groupby對象，該對象本身不會返回任何東西，只有當相應的方法被調用才會起作用

例如取出某一個組：

grouped_single.get_group('S_2').head()

（b）根據某幾列分組

grouped_mul = df.groupby(['School','Class'])

grouped_mul.get_group(('S_1','C_3'))

（c）組容量與組數

grouped_single.size()

grouped_mul.size()

grouped_single.ngroups

grouped_mul.ngroups

（d）組的遍歷

for name,group in grouped_single:

print(name)

display(group.head())

（e）level參數（用於多級索引）和axis參數

df.set_index(['Gender','School']).groupby(level=1,axis=0).get_group('S_1')

2. groupby對象的特點

（a）查看所有可調用的方法

由此可見，groupby對象可以使用相當多的函數，靈活程度很高

print([attr for attr in dir(grouped_single) if not attr.startswith('_')])

（b）分組對象的head和first

對分組對象使用head函數，返回的是每個組的前幾行，而不是數據集前幾行

grouped_single.head(3)

first顯示的是以分組為索引的每組的第一個分組信息

grouped_single.first()

（c）分組依據

對於groupby函數而言，分組的依據是非常自由的，只要是與數據框長度相同的列表即可，同時支持函數型分組

np.random.choice(['a','b','c'],df.shape[0])

df.groupby(np.random.choice(['a','b','c'],df.shape[0])).get_group('b').head()

#相當於將np.random.choice(['a','b','c'],df.shape[0])當做新的一列進行分組

從原理上說，我們可以看到利用函數時，傳入的對象就是索引，因此根據這一特性可以做一些複雜的操作

df[:5].groupby(lambda x:print(x)).head(0)

根據奇偶行分組

df.groupby(lambda x:'奇數行' if not df.index.get_loc(x)%2==1 else '偶數行').groups

如果是多層索引，那麼lambda表達式中的輸入就是元組，下面實現的功能為查看兩所學校中男女生分別均分是否及格

注意：此處只是演示groupby的用法，實際操作不會這樣寫

math_score = df.set_index(['Gender','School'])['Math'].sort_index()

grouped_score = df.set_index(['Gender','School']).sort_index().groupby(lambda x:(x,'均分及格' if math_score[x].mean()>=60 else '均分不及格'))

for name,_ in grouped_score:

print(name)

（d）groupby的[]操作

可以用[]選出groupby對象的某個或者某幾個列，上面的均分比較可以如下簡潔地寫出：

df.groupby(['Gender','School'])['Math'].mean()>=60

用列表可選出多個屬性列：

df.groupby(['Gender','School'])[['Math','Height']].mean()

（e）連續型變量分組

例如利用cut函數對數學成績分組：

bins = [0,40,60,80,90,100]

cuts = pd.cut(df['Math'],bins=bins) #可選label添加自定義標籤

df.groupby(cuts)['Math'].count()

三、聚合、過濾和變換

1. 聚合（Aggregation）

（a）常用聚合函數

所謂聚合就是把一堆數，變成一個標量，因此mean/sum/size/count/std/var/sem/describe/first/last/nth/min/max都是聚合函數

為了熟悉操作，不妨驗證標準誤sem函數，它的計算公式是：\frac{組內標準差}{\sqrt{組容量}}組容量組內標準差，下面進行驗證：

group_m = grouped_single['Math']

group_m.std().values/np.sqrt(group_m.count().values)== group_m.sem().values

（b）同時使用多個聚合函數

group_m.agg(['sum','mean','std'])

利用元組進行重命名

group_m.agg([('rename_sum','sum'),('rename_mean','mean')])

指定哪些函數作用哪些列

grouped_mul.agg({'Math':['mean','max'],'Height':'var'})

（c）使用自定義函數

grouped_single['Math'].agg(lambda x:print(x.head(),'間隔'))

#可以發現，agg函數的傳入是分組逐列進行的，有了這個特性就可以做許多事情

官方沒有提供極差計算的函數，但通過agg可以容易地實現組內極差計算

grouped_single['Math'].agg(lambda x:x.max()-x.min())

（d）利用NamedAgg函數進行多個聚合

注意：不支持lambda函數，但是可以使用外置的def函數

def R1(x):

return x.max()-x.min()

def R2(x):

return x.max()-x.median()

grouped_single['Math'].agg(min_score1=pd.NamedAgg(column='col1', aggfunc=R1),

max_score1=pd.NamedAgg(column='col2', aggfunc='max'),

range_score2=pd.NamedAgg(column='col3', aggfunc=R2)).head()

（e）帶參數的聚合函數

判斷是否組內數學分數至少有一個值在50-52之間：

def f(s,low,high):

return s.between(low,high).max()

grouped_single['Math'].agg(f,50,52)

如果需要使用多個函數，並且其中至少有一個帶參數，則使用wrap技巧：

def f_test(s,low,high):

return s.between(low,high).max()

def agg_f(f_mul,name,*args,**kwargs):

def wrapper(x):

return f_mul(x,*args,**kwargs)

wrapper.__name__ = name

return wrapper

new_f = agg_f(f_test,'at_least_one_in_50_52',50,52)

grouped_single['Math'].agg([new_f,'mean']).head()

2. 過濾（Filteration）

filter函數是用來篩選某些組的（務必記住結果是組的全體），因此傳入的值應當是布爾標量

grouped_single[['Math','Physics']].filter(lambda x:(x['Math']>32).all()).head()

3. 變換（Transformation）

（a）傳入對象

transform函數中傳入的對象是組內的列，並且返回值需要與列長完全一致

grouped_single[['Math','Height']].transform(lambda x:x-x.min()).head()

如果返回了標量值，那麼組內的所有元素會被廣播為這個值

grouped_single[['Math','Height']].transform(lambda x:x.mean()).head()

（b）利用變換方法進行組內標準化

grouped_single[['Math','Height']].transform(lambda x:(x-x.mean())/x.std()).head()

（c）利用變換方法進行組內缺失值的均值填充

df_nan = df[['Math','School']].copy().reset_index()

df_nan.loc[np.random.randint(0,df.shape[0],25),['Math']]=np.nan

df_nan.head()

df_nan.groupby('School').transform(lambda x: x.fillna(x.mean())).join(df.reset_index()['School']).head()

四、apply函數

1. apply函數的靈活性

可能在所有的分組函數中，apply是應用最為廣泛的，這得益於它的靈活性：

對於傳入值而言，從下面的列印內容可以看到是以分組的表傳入apply中：

df.groupby('School').apply(lambda x:print(x.head(1)))

apply函數的靈活性很大程度來源於其返回值的多樣性：

① 標量返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x.max())

② 列表返回值

df[['School','Math','Height']].groupby('School').apply(lambda x:x-x.min()).head()

③ 數據框返回值

df[['School','Math','Height']].groupby('School')\

.apply(lambda x:pd.DataFrame({'col1':x['Math']-x['Math'].max(),

'col2':x['Math']-x['Math'].min(),

'col3':x['Height']-x['Height'].max(),

'col4':x['Height']-x['Height'].min()})).head()

2. 用apply同時統計多個指標

此處可以藉助OrderedDict工具進行快捷的統計：

from collections import OrderedDict

def f(df):

data = OrderedDict()

data['M_sum'] = df['Math'].sum()

data['W_var'] = df['Weight'].var()

data['H_mean'] = df['Height'].mean()

return pd.Series(data)

grouped_single.apply(f)

pandas系列(三)Pandas分組

相關焦點

對比MySQL學習Pandas的groupby分組聚合

十分鐘學習pandas!pandas常用操作總結!

一場pandas與SQL的巔峰大戰(七)

「Python替代Excel Vba」系列(二):pandas分組統計與操作Excel

十分鐘學習pandas! pandas常用操作總結!

pandas使用的100個trick

掌握pandas中的transform

Pandas數據清洗系列:read_csv函數詳解(三)

【數據分析】Pandas

Pandas與Excel徹底拉開了距離.....

一場pandas與SQL的巔峰大戰(二)

pandas指南:做更高效的數據科學家

Pandas的crosstab函數

Pandas的介紹與基本使用

「Python替代Excel Vba」系列(三):pandas處理不規範數據

超全的pandas數據分析常用函數總結:下篇

數據分析之Pandas變形操作總結

懶人秘籍:教你如何避免編寫pandas代碼

Pandas系列 - 基本數據結構

如何通過一頓飯來說明NumPy與pandas的功用