機器學習中另一個非常重要的庫--Pandas庫,Pandas是對數據進行預處理和數據清洗非常重要的庫。
使用pandas庫相比NumPy庫有什麼好處,pandas庫比NumPy庫封裝了哪些特性
1. pandas會自動講數據按照自定義的方式進行對齊顯示,避免數據沒有對齊造成處理數據的時候出現失誤
2. pandas可以很靈活的處理缺失的數據,如果某個數據缺失,可以基於大部分數據的平均值進行填充,還可以填充想要的數值
3.使用pandas也可以像使用SQL語句一樣,進行相應的連結操作
Pandas的安裝
如果要使用Pandas庫,首先要進行安裝。
1)首先打開cmd終端,輸入pip install pandas(這樣可能會因為超時導致安裝失敗,不妨試一下pip --default-time=10000 install pandas)
2)其實中間安裝失敗很多次,我也找了其他的解決方案,如果方法一對你來說不可取,那麼就看方法二吧,百度搜索(原文地址: http://www.wsmee.com/post/6版權聲明:非商用-非衍生-保持署名)
Series的基本操作
pandas一維數組做法
from pandas import Series,DataFrame#Series,DataFrame兩個重要的數據結構
import pandas as pd#如果覺得名稱比較長也可以重命名,大部分機器學習pd都是使用了pandas庫
obj=Series([4,5,6,-7])print(obj)
單獨取出索引或者數值
from pandas import Series,DataFrameimport pandas as pdobj=Series([4,5,6,-7])print(obj)print(obj.index)print(obj.values)注意:pandas的索引是可以重複的,字典是不一樣的
字典是經過哈希運算的
哈希運算:通過一個簡單的字符,經過哈希運算運算成唯一的哈希運算值
{』a』:1,』b』:2,』c』:3}分別映射到非常複雜的一串數值存儲到內存,新的key加入字典當中,也會先進行映射;如果與字典存儲相同的值,就會跟原有的結果覆蓋,所以字典當中的key是不可以重複的
可作為字典的key:int、float、string、tuple
不可作為字典的key:list、集合(Q:為什麼不可作為key?A:因為內容是可變化的,列表可以重新賦值,經過衝賦值之後,哈希運算的複雜字符串發生了相應的變化,發生變化,就沒有辦法key找到對應的值,無法進行哈希運算)
手工指定索引
from pandas import Series,DataFrameimport pandas as pdobj2=Series([4,5,6,-7],index=['a','b','c','d'])print(obj2)按照定義的順序輸出
對c的索引賦值
from pandas import Series,DataFrameimport pandas as pdobj2=Series([4,5,6,-7],index=['a','b','c','d'])print(obj2)obj2['c']=8print(obj2)注意
c值發生改變
可以把Series當作字典來用-如果在則返回true,不在返回false
from pandas import Series,DataFrameimport pandas as pdobj2=Series([4,5,6,-7],index=['a','b','c','d'])print(obj2)obj2['c']=8print(obj2)print('a' in obj2)字典轉化為Series
如果數據被存儲到一個字典當中,能否很方便的轉化Series呢
from pandas import Series,DataFrameimport pandas as pdstade={'beijing':2100,'shanghai':3500,'guangzhou':9088,'xuzhou':3908}obj3=Series(stade)print(obj3)
字典的key作為Series的索引,字典的value作為Series的相應取值
索引的修改
from pandas import Series,DataFrameimport pandas as pdstade={'beijing':2100,'shanghai':3500,'guangzhou':9088,'xuzhou':3908}obj3=Series(stade)print(obj3)obj3.index=['bj','sh','gz','xz']print(obj3)DateFrame的基本操作
DateFrame更像是電子表格一樣的形式
多維數據
Q:如何生成DateFrame?
A:一般會傳入等長的列表,或者利用NumPy的數組傳入
在這裡我們利用字典,等長列表的方式,創建一個DateFrame
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)print(frame)橫坐標就是data的key,縱坐標系統自動生成的,列表會自動顯示在表格當中
可以觀察類似電子表格
那麼,對電子表格大家經常會有哪些操作呢?
按順序顯示
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])print(frame)print(frame2)對比查看排序前和排序後的信息
DataFrame是二維表格,轉化一維數據
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])print(frame)print(frame2)print(frame2['city'])print(frame2.city)
增加新的列
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])frame2['new']=100print(frame2)
利用表格的計算生成新的列
--根據是否為shanghai產生的新的一列為true,否則為false
from pandas import Series,DataFrameimport pandas as pddata={'city':['shanghai','beijing','tianjin','beijing','shanghai','xuzhou'], 'year':[2016,2017,2018,2019,2020,2021], 'pop':[1.1,1.5,2.3,3.5,2.1,5.1]}frame=DataFrame(data)frame2=DataFrame(data,columns=['year','city','pop'])frame2['SH']=frame2.city=='shanghai'print(frame2)
字典的嵌套為DataFrame賦值
from pandas import Series,DataFrameimport pandas as pdpop={'beijing':{2008:1.5,2020:2.0}, 'shanghai':{2008:2.0,2020:3.0}}frame3=DataFrame(pop)print(frame3)行和列的互換--行列式的轉置
from pandas import Series,DataFrameimport pandas as pdpop={'beijing':{2008:1.5,2020:2.0}, 'shanghai':{2008:2.0,2020:3.0}}frame3=DataFrame(pop)print(frame3)print(frame3.T)
DataFrame的重新索引-reindex
from pandas import Series,DataFrameimport pandas as pdobj4=Series([4,5,6,-7],index=['c','d','b','a'])obj5=obj4.reindex(['a','b','c','d','e'])print(obj5)說明e下面的索引是不存在的
如果是空值的話可能引起數值的清洗
將空值進行填充
from pandas import Series,DataFrameimport pandas as pdobj4=Series([4,5,6,-7],index=['c','d','b','a'])obj5=obj4.reindex(['a','b','c','d','e'],fill_value=0)print(obj5)
將空值填充為相鄰的數值
from pandas import Series,DataFrameimport pandas as pdobj6=Series(['red','yellow','blue'],index=[0,2,4])print(obj6.reindex(range(6)))from pandas import Series,DataFrameimport pandas as pdobj6=Series(['red','yellow','blue'],index=[0,2,4])print(obj6.reindex(range(6),method='ffill'))
將後面的值搬運到前面
print(obj6.reindex(range(6),method='bfill'))
Series結構缺失的數據刪除
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata=Series([1,NA,2])print(data)print(data.dropna())DataFrame結構缺失的數據刪除
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,5,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,2]])print(data2.dropna())只要出現了NA都會被dropna刪除
刪掉全是缺失值的行,部分缺失數據的行保留
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,5,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2.dropna(how='all'))刪除全部列是缺失數據
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,NA,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2)print(data2.dropna(axis=1,how='all'))
將缺失值填充為0
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,NA,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2)print(data2.dropna(axis=1,how='all'))data2.fillna(0)print(data2.fillna(0))修改
from pandas import Series,DataFrameimport pandas as pdfrom numpy import nan as NAdata2=DataFrame([[1,1.5,4,NA,-3.5],[1,NA,NA,NA,8],[NA,NA,NA,NA,NA]])print(data2)print(data2.dropna(axis=1,how='all'))data2.fillna(0)print(data2.fillna(0,inplace=True))print(data2)
層次化索引
根據索引的層次提取數據
from pandas import Series, DataFrameimport pandas as pdimport numpy as npfrom numpy import nan as NAdata3 = Series(np.random.random(10), index=[['q', 'w', 'e', 'r', 't', 'q', 'q', 'a', 'a', 'f'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]])print(data3)提取a層次的索引
from pandas import Series, DataFrameimport pandas as pdimport numpy as npfrom numpy import nan as NAdata3 = Series(np.random.random(10), index=[['q', 'w', 'e', 'r', 't', 'q', 'q', 'a', 'a', 'f'], [1, 2, 3, 4, 5, 6, 7, 8, 9, 0]])print(data3)print(data3['a'])輸入多個索引
print(data3['a':'q'])
Series結構和DataFrame結構如何解決對裡面的缺失數據進行刪除、填充,這就是對數據的預處理,對數據清洗非常關鍵的步驟
通過以上演示可以格式進行預處理,後面可以根據這些數據完成繪圖,進行建模。