那些可以提升pandas的使用小技巧、你知道了嗎

2021-02-08 在成長的風一

❝

已識乾坤大、猶憐草木青

❞pandas是Python中常用的數據分析庫、出現頻率非常高;而且極其功能非常多，即使是pandas使用老手也無法保證可以高效使用pandas進行數據分析、以下梳理幾個實用的高效實用pandas小技巧。

import pandas as pd
1.從剪切板(複製)中創建DataFramepandas中的 read_clipboard() 支持可以把剪切板中的數據直接轉化成DataFrame格式、也就是說可以把複製的內容直接轉化(尤其是excel中的表格數據、複製轉化成DataFrame格式、這樣就便於直接進行數據分析了)# 複製excel的表格內容後
pd.read_clipboard()
2.通過數據類型選項columns數據分析的過程當中、可能需要篩選數據列、比如只需要數值列
# 創建數據(使用1中的方法、複製表格內容)
df = pd.read_clipboard()
df

姓名年齡興趣畢業否0張三22.0學習False1李四23.0健身True# 查看數據各列的數據類型
df.dtypes
姓名      object
年齡     float64
興趣      object
畢業否       bool
dtype: object
# 再給數據集合添加一列int類型
df['存款'] = 100
df.dtypes
姓名      object
年齡     float64
興趣      object
畢業否       bool
存款       int64
dtype: object
2.1.通過select_dtypes方法選擇數值列、也就是數據類型為 int64、 float64的列df.select_dtypes(include='number')
2.2.選擇除了int類型的其它的列 (參數為exclude)df.select_dtypes(exclude='int64')

姓名年齡興趣畢業否0張三22.0學習False1李四23.0健身True2.3.同時選擇多種類型df.select_dtypes(include=['object', 'int64', 'bool'])

姓名興趣畢業否存款0張三學習False1001李四健身True1003.將String類型改為Numbers類型在pandas中有兩種方法可以將字符串修改為數值； a、astype() 方法   ,   b、to_numeric()方法 ;  兩者方法會有所不同# 創建數據(全部為字符串類型、但存在字符串的數據以及  -  特殊數據)
df = pd.DataFrame({ 'product': ['A','B','C','D'], 
                   'price': ['10','20','30','40'],
                   'sales': ['20','-','60','-']
                  })
df

productpricesales0A10201B20-2C30603D40-使用astype() 方法分別將 價格 和 銷量進行轉化、看是否有啥不一樣df['price'] = df['price'].astype(int)
df['sales'] = df['sales'].astype(int) # 報錯(存在字符串 - )


ValueError                                Traceback (most recent call last)

<ipython-input-16-e16e56499e99> in <module>
----> 1 df['sales'] = df['sales'].astype(int)


d:\python\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)


d:\python\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):


d:\python\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 


d:\python\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):


d:\python\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    641                     # _astype_nansafe works fine with 1-d only
    642                     vals1d = values.ravel()
--> 643                     values = astype_nansafe(vals1d, dtype, copy=True, **kwargs)
    644 
    645                 # TODO(extension)


d:\python\lib\site-packages\pandas\core\dtypes\cast.py in astype_nansafe(arr, dtype, copy, skipna)
    705         # work around NumPy brokenness, #1987
    706         if np.issubdtype(dtype.type, np.integer):
--> 707             return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    708 
    709         # if we have a datetime/timedelta array of objects


pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()


ValueError: invalid literal for int() with base 10: '-'
df.dtypes  # 價格類型被轉換成Int類型了
product    object
price       int32
sales      object
dtype: object
使用to_numeric()方法可以解決掉上訴銷量無法轉換的問題；只需要設置參數 errors='coerce'df['sales'] = pd.to_numeric(df['sales'], errors='coerce')
df

productpricesales0A1020.01B20NaN2C3060.03D40NaNdf.dtypes
product     object
price        int32
sales      float64
dtype: object
4、檢測並處理缺失值檢測缺失值4.1.統計每列非缺失值的數量df = pd.read_clipboard()
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
姓名     10 non-null object
興趣     10 non-null object
畢業否    10 non-null object
存款     5 non-null float64
dtypes: float64(1), object(3)
memory usage: 264.0+ bytes
4.2.info()查看有點不直觀、使用df.isnull().sum()清楚的知道每列有多少缺失值df.isnull().sum()
姓名     0
興趣     0
畢業否    0
存款     4
dtype: int64
4.3.使用 isnull().sum().sum()查看數據集共有多少缺失值df.isnull().sum().sum()
4
4.4.使用 df.isna().mean()方法查看缺失值在該列的佔比是多少# 使用 df.isna().mean()方法查看缺失值在該列的佔比是多少
df.isna().mean()
姓名     0.0
興趣     0.0
畢業否    0.0
存款     0.4
dtype: float64
df.isnull().mean()
姓名     0.0
興趣     0.0
畢業否    0.0
存款     0.4
dtype: float64
處理缺失值; 刪除和替換4.1.刪除包含缺失值的行df.dropna(axis=0)

姓名興趣畢業否存款0張三學習FALSE1001



2張三學習FALSE1004張三學習FALSE1008張三學習FALSE1009李四健身TRUE1004.2.刪除包含缺失值的列df.dropna(axis=1)

姓名興趣畢業否0張三學習FALSE1


2張三學習FALSE3李四TRUE1004張三學習FALSE5健身TRUE1006張三學習1007李四健身TRUE8張三學習FALSE9李四健身TRUE4.3.刪除缺失值超過10%的列df.dropna(thresh=len(df)*0.9, axis=1)

姓名興趣畢業否0張三學習FALSE1


2張三學習FALSE3李四TRUE1004張三學習FALSE5健身TRUE1006張三學習1007李四健身TRUE8張三學習FALSE9李四健身TRUE4.4.使用標量替換缺失值df.fillna(value=1)

姓名興趣畢業否存款0張三學習FALSE1001



2張三學習FALSE1003李四TRUE10014張三學習FALSE1005健身TRUE10016張三學習10017李四健身TRUE18張三學習FALSE1009李四健身TRUE1004.5.使用上一行對應的值替換缺失值df.fillna(axis=0, method='ffill')

姓名興趣畢業否存款0張三學習FALSE1001



2張三學習FALSE1003李四TRUE1001004張三學習FALSE1005健身TRUE1001006張三學習1001007李四健身TRUE1008張三學習FALSE1009李四健身TRUE1004.6.使用前一列位置的值替換缺失值df.fillna(axis=0, method='ffill')

姓名興趣畢業否存款0張三學習FALSE1001



2張三學習FALSE1003李四TRUE1001004張三學習FALSE1005健身TRUE1001006張三學習1001007李四健身TRUE1008張三學習FALSE1009李四健身TRUE1004.7.使用下一行對應位置的值替換缺失值df.fillna(axis=0, method='bfill')

姓名興趣畢業否存款0張三學習FALSE1001



2張三學習FALSE1003李四TRUE1001004張三學習FALSE1005健身TRUE1001006張三學習1001007李四健身TRUE1008張三學習FALSE1009李四健身TRUE1004.8.使用某一列的平均值替換缺失值df['存款'] = pd.to_numeric(df['存款'], errors='coerce')
df['存款'].fillna(value=df['存款'].mean(), inplace=True)   # 當然也可以使用其它的值來替換
5、對連續數據進行離散化處理在數據準備過程中，常常會組合或者轉換現有特徵以創建一個新的特徵，其中將連續數據離散化是非常重要的特徵轉化方式，也就是將數值變成類別特徵。df = pd.read_clipboard()
df

age010111212313414.106116107117108118109119110120111 rows × 1 columns
年齡是一段連續值，如果我們想對它進行分組變成分類特徵，比如（<=12，兒童）、（<=18，青少年）、（<=60，成人）、（>60，老人），可以用cut方法實現：
import sys
df['ageGroup']=pd.cut(
                    df['age'], 
                    bins=[0, 13, 19, 61, sys.maxsize],   # sys.maxsize是指可以存儲的最大值
                    labels=['兒童', '青少年', '成人', '老人']
                      )

df.head()

ageageGroup010兒童111兒童212兒童313兒童414青少年6、從多個文件中構建一個DataFrame有時候數據集可能分布在多個excel或者csv文件中，但需要把它讀取到一個DataFrame中，這樣的需求該如何實現？
做法是分別讀取這些文件，然後將多個dataframe組合到一起，變成一個dataframe。
這裡使用內置的glob模塊，來獲取文件路徑，簡潔且更有效率。
from glob import glob
files = sorted(glob('D:\python\exercise_data\sale_data\sale*.xlsx'))  # 指定目錄查找文件
files
['D:\\python\\exercise_data\\sale_data\\sale01.xlsx',
 'D:\\python\\exercise_data\\sale_data\\sale02.xlsx']
6.1.行的合併df1 = pd.read_excel('D:\python\exercise_data\sale_data\sale01.xlsx', nrows=10)  # 只讀取10條數據
df1

dealerscodeprocountproamout0YLQY01412710363.001YLQY014127277.222YLQY014127602316.603YLQY01412715581.254YLQY01412720775.005YLQY01412710386.106YLQY01412720772.207YLQY0141272007750.008YLQY01412723891.259YLQY014127271046.25df2 = pd.read_excel('D:\python\exercise_data\sale_data\sale02.xlsx')
df2.head() # 默認查看前5條數據

dealerscodeprocountproamout0YLQY0141272.036.81YLQY0141278.0147.22YLQY0141272.037.03YLQY0141275.092.54YLQY01412730.0552.0# 分別讀取文件後、使用concat進行連接
files = sorted(glob('D:\python\exercise_data\sale_data\sale*.xlsx'))
pd.concat((pd.read_excel(file) for file in files), ignore_index=True)

dealerscodeprocountproamout0YLQY01412710363.001YLQY014127277.222YLQY014127602316.603YLQY01412715581.254YLQY01412720775.005YLQY01412710386.106YLQY01412720772.207YLQY0141272007750.008YLQY01412723891.259YLQY014127271046.2510YLQY014127501937.5011YLQY01412710387.5012YLQY014127243.0013YLQY01412710386.1014YLQY014127236.8015YLQY0141278147.2016YLQY014127237.0017YLQY014127592.5018YLQY01412730552.0019YLQY014127236.8020YLQY014127236.8021YLQY014127355.2022YLQY014127592.0023YLQY01412710183.0024YLQY014127355.2025YLQY0141274007320.0026YLQY014127592.5027YLQY01412730573.006.2.列的合併files = sorted(glob('D:\python\exercise_data\sale_data\sale*.xlsx'))
pd.concat((pd.read_excel(file) for file in files), axis=1)

dealerscodeprocountproamoutdealerscodeprocountproamout0YLQY01412710363.00YLQY014127236.81YLQY014127277.22YLQY0141278147.22YLQY014127602316.60YLQY014127237.03YLQY01412715581.25YLQY014127592.54YLQY01412720775.00YLQY01412730552.05YLQY01412710386.10YLQY014127236.86YLQY01412720772.20YLQY014127236.87YLQY0141272007750.00YLQY014127355.28YLQY01412723891.25YLQY014127592.09YLQY014127271046.25YLQY01412710183.010YLQY014127501937.50YLQY014127355.211YLQY01412710387.50YLQY0141274007320.012YLQY014127243.00YLQY014127592.513YLQY01412710386.10YLQY01412730573.0

那些可以提升pandas的使用小技巧、你知道了嗎

相關焦點

快速提升效率的6個pandas使用小技巧

6個提升效率的pandas小技巧

你不知道的pandas的5個基本技巧

你可能不知道的pandas的5個基本技巧

關於Pandas數據處理你不知道的技巧!

這些pandas技巧你還不會嗎 | Pandas實用手冊(PART II)

整理了9個Pandas實用技巧

整理了25個Pandas數據分析的實用技巧

pandas100個騷操作:再見 for 循環!速度提升315倍!

pandas100個騷操作:使用 Datetime 提速 50 倍運行速度!

11個Python Pandas小技巧讓你的工作更高效(附代碼實例)

Python學習指南| 快速入門Pandas數據分析技巧

獨家 | 11個Python Pandas小技巧讓你的工作更高效(附代碼實例)

5個Pandas技巧

幾個Pandas參數設置小技巧

數據處理技巧 | 帶你了解Pandas.groupby() 常用數據處理方法

5個鮮為人知的Pandas技巧

使用Pandas進行數據處理

還在抱怨Pandas運行速度慢?這幾個方法會顛覆你的看法

用這幾個方法提高pandas運行速度