認認真真系統學習數據分析
本文繼續學習Python數據分析知識,前期的知識點可點擊以下藍色字體連結進行回看複習:
數據分析開篇:一個簡單的應用(2019/11/04)
2020年數據分析必知必會(一):NumPy數組
2020年數據分析必知必會(二):NumPy摘要----文章末尾附Python
2020年數據分析必知必會(三):數組的形狀和屬性(有福利贈予)
數據分析必知必必會(四):數組的轉換,視圖,拷貝,索引和廣播(這裡的「廣播」是一個數組的應用:數據處理舊手機鈴聲)
2020年數據分析必知必會(五):統計和線性代數(使用Numpy與Scipy實現)
2020年數據分析必知必會(六):離散式複製的創建(以北京最近的豬肉價格為例子)
2020年數據分析必知必會(七):pandas入門與數據結構基礎
廢話不多說,直接上乾貨
....
正文開始
1、pandas如何查詢數據?
從前面的學習我們已經知道,pandas的DataFrame數據結構類似於關係資料庫類型,那麼查詢方式也就如出一轍了。
數據的背景(你也可選擇其他數據作為例子):太陽黑子
下面以一個眾所周知的太陽黑子爆發數據為例子,如圖(來源百度)
太陽的光球表面有時會出現一些暗的區域,它是磁場聚集的地方,這就是太陽黑子。
獲取太陽黑子數據:
太陽黑子數據網站:https://www.quandl.com/data/SIDC/SUNSPOTS_13-Total-Sunspot-Numbers-13-Month
網站每天使用權限:每天使用Python下載該網站的數據最多調用50次。
Python下載黑子數據需要的模塊命令:(win+r+cmd)
pip install Quandl 或 python -m pip install Quandl這裡的Quandl的API接口是免費的,因此為了方便python公司將Quandl作為模塊導入黑子爆發的數據。注意:使用次數超過需要註冊。
(1)、下載數據:2018至2019年間13個月的太陽黑子總數
(下載速度慢,請耐心等待)
>>> import quandl>>> sunspots = quandl.get("SIDC/SUNSPOTS_13")#2019年的數據SIDC/SUNSPOTS_13 #2018"SIDC/SUNSPOTS_A">>> sunspots 13-Month Smoothed Total Sunspot Number ... Definitive/Provisional IndicatorDate ...1749-01-31 NaN ... 1.01749-02-28 NaN ... 1.01749-03-31 NaN ... 1.01749-04-30 NaN ... 1.01749-05-31 NaN ... 1.0... ... ... ...2019-06-30 NaN ... 0.02019-07-31 NaN ... 0.02019-08-31 NaN ... 0.02019-09-30 NaN ... 0.02019-10-31 NaN ... 0.0
[3250 rows x 4 columns]順便也下了2017至2018的,主要是覺得上述數據有NaN空值
>>> import quandl>>> sunspots = quandl.get("SIDC/SUNSPOTS_A")>>> sunspots Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1700-12-31 8.3 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01702-12-31 26.7 NaN NaN 1.01703-12-31 38.3 NaN NaN 1.01704-12-31 60.0 NaN NaN 1.0... ... ... ... ...2014-12-31 113.3 8.0 5273.0 1.02015-12-31 69.8 6.4 8903.0 1.02016-12-31 39.8 3.9 9940.0 1.02017-12-31 21.7 2.5 11444.0 1.02018-12-31 7.0 1.1 12611.0 1.0
[319 rows x 4 columns](2)、指定導出數據的開頭和結尾最後的幾行數據
方法:
使用函數head(n)和tail(n)分別下載數據的前n行和後n行,其中n為你要下載的行數。
假設參數n=4,那麼就有:
import quandlsunspots = quandl.get("SIDC/SUNSPOTS_A")#2019年的數據SIDC/SUNSPOTS_13 #2018"SIDC/SUNSPOTS_A"print(sunspots)print("前四行數據為:",sunspots.head(4))print("後四行數據為:",sunspots.tail(4))執行結果:
>> print("前四行數據為:",sunspots.head(4))前四行數據為:Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1700-12-31 8.3 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01702-12-31 26.7 NaN NaN 1.01703-12-31 38.3 NaN NaN 1.0>>> print("後四行數據為:",sunspots.tail(4))後四行數據為:Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate2015-12-31 69.8 6.4 8903.0 1.02016-12-31 39.8 3.9 9940.0 1.02017-12-31 21.7 2.5 11444.0 1.02018-12-31 7.0 1.1 12611.0 1.0>>>(3)、查詢最近2018年最後一天的太陽黑子的數據
(注意:只能查一個數據,而且是每月最後一天)
last_data = sunspots.index[-1]print("最近數據:",sunspots.loc[last_data])執行結果:
>>> last_data = sunspots.index[-1]>>> print("最近數據:",sunspots.loc[last_data])最近數據: Yearly Mean Total Sunspot Number 7.0Yearly Mean Standard Deviation 1.1Number of Observations 12611.0Definitive/Provisional Indicator 1.0Name: 2018-12-31 00:00:00, dtype: float64(4)、查詢指定日期範圍內的數據
(切記:按照年月日格式來查,且數據結果不包括範圍的區間的端點值)
假設我想查2008年8月8日到2018年1月1日中最後一月最後一天的數據,日期格式為
代碼如下:
print("查找指定的日期數據",sunspots["20080808":"20180101"])執行結果:(可以看到都是12月31日)
>>> print("查找指定的日期數據",sunspots["20080808":"20180101"])查找指定的日期數據 Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate2008-12-31 4.2 2.5 6644.0 1.02009-12-31 4.8 2.5 6465.0 1.02010-12-31 24.9 3.4 6328.0 1.02011-12-31 80.8 6.7 6077.0 1.02012-12-31 84.5 6.7 5753.0 1.02013-12-31 94.0 6.9 5347.0 1.02014-12-31 113.3 8.0 5273.0 1.02015-12-31 69.8 6.4 8903.0 1.02016-12-31 39.8 3.9 9940.0 1.02017-12-31 21.7 2.5 11444.0 1.0(5)、指定索引來查詢
print("指定索引查詢",sunspots.iloc[[2,1,0,-1,-2]])執行結果:
>>> print("指定索引查詢",sunspots.iloc[[2,1,0,-1,-2]])指定索引查詢 Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1702-12-31 26.7 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01700-12-31 8.3 NaN NaN 1.02018-12-31 7.0 1.1 12611.0 1.02017-12-31 21.7 2.5 11444.0 1.0>>>如果按照順序,也可這麼來輸出:
>>> print("指定索引查詢",sunspots.iloc[[-2,-1,0,1,2]])指定索引查詢 Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate2017-12-31 21.7 2.5 11444.0 1.02018-12-31 7.0 1.1 12611.0 1.01700-12-31 8.3 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01702-12-31 26.7 NaN NaN 1.0>>>(6)、查詢指定變量值
換句話說就是查詢指定行和列對應位置的數值,類似矩陣或數組中查詢指定行列位置的元素。
print("查詢第3行第4列元素", sunspots.iloc[2, 3])print("查詢第2行第1列元素", sunspots.iat[1, 0])執行結果:
>>> print("查詢第3行第4列元素", sunspots.iloc[2, 3])查詢第3行第4列元素 1.0>>> print("查詢第2行第1列元素", sunspots.iat[1, 0])查詢第2行第1列元素 18.3(7)、查詢布爾型變量
這裡需要使用平均值函數:mean()
下面查詢各個大於平均值的數值
print("Boolean selection",sunspots[sunspots > sunspots.mean()])執行結果:
>>> print("Boolean selection",sunspots[sunspots > sunspots.mean()])Boolean selection Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1700-12-31 NaN NaN NaN NaN1701-12-31 NaN NaN NaN NaN1702-12-31 NaN NaN NaN NaN1703-12-31 NaN NaN NaN NaN1704-12-31 NaN NaN NaN NaN... ... ... ... ...2014-12-31 113.3 8.0 5273.0 NaN2015-12-31 NaN NaN 8903.0 NaN2016-12-31 NaN NaN 9940.0 NaN2017-12-31 NaN NaN 11444.0 NaN2018-12-31 NaN NaN 12611.0 NaN
[319 rows x 4 columns]2、利用pandas的DataFrmae進行統計計算
為了方便,這裡先給出統計函數的一些描述:
idxmin 最小值的索引值
idxmax 最大值的索引值
describe 一次性 多種維度統計
count 非NA值的數量
min 最小值
max 最大值
argmin 最小值的索引位置
argmax 最大值的索引位置
sum 總和
mean 平均數
median 算術中位數
mad 根據平均值計算平均絕對離差
var 樣本值的方差
std 樣本值的標準差
skew 樣本值的偏度(三階矩陣)
kurt 樣本值的峰度(四階矩陣)
cumsum 樣本值的累積和
cummin、cummax 樣本值的最大值、最小值
cumprod 樣本值的累計積
diff 計算一階差分
pct_change 計算百分數變化
下面舉出幾個例子說明上述用途,其他類似去使用,代碼如下:
import quandl#統計計算# Data from http:# PyPi url https:sunspots = quandl.get("SIDC/SUNSPOTS_A")print("Describe", sunspots.describe())print("非NAN數值的數量Non NaN observations",sunspots.count())print("平均絕對標準差MAD", sunspots.mad())print("中位數Median", sunspots.median())print("Min", sunspots.min())print("Max", sunspots.max())print("眾數Mode", sunspots.mode())print("離散度的標準差Standard Deviation", sunspots.std())print("方差Variance", sunspots.var())print("偏態係數Skewness", sunspots.skew())print("峰態係數Kurtosis", sunspots.kurt())執行結果:
>>> import quandl>>> ... ... ...>>> sunspots = quandl.get("SIDC/SUNSPOTS_A")>>> print("Describe", sunspots.describe())Describe Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional Indicatorcount 319.000000 201.000000 201.000000 319.0mean 78.970533 7.947761 1572.751244 1.0std 62.019871 3.840522 2667.888556 0.0min 0.000000 1.100000 150.000000 1.025% 24.800000 4.700000 365.000000 1.050% 65.800000 7.600000 365.000000 1.075% 115.750000 10.400000 366.000000 1.0max 269.300000 19.100000 12611.000000 1.0>>> print("非NAN數值的數量Non NaN observations",sunspots.count())非NAN數值的數量Non NaN observations Yearly Mean Total Sunspot Number 319Yearly Mean Standard Deviation 201Number of Observations 201Definitive/Provisional Indicator 319dtype: int64>>> print("平均絕對標準差MAD", sunspots.mad())平均絕對標準差MAD Yearly Mean Total Sunspot Number 50.954279Yearly Mean Standard Deviation 3.155848Number of Observations 1990.750773Definitive/Provisional Indicator 0.000000dtype: float64>>> print("中位數Median", sunspots.median())中位數Median Yearly Mean Total Sunspot Number 65.8Yearly Mean Standard Deviation 7.6Number of Observations 365.0Definitive/Provisional Indicator 1.0dtype: float64>>> print("Min", sunspots.min())Min Yearly Mean Total Sunspot Number 0.0Yearly Mean Standard Deviation 1.1Number of Observations 150.0Definitive/Provisional Indicator 1.0dtype: float64>>> print("Max", sunspots.max())Max Yearly Mean Total Sunspot Number 269.3Yearly Mean Standard Deviation 19.1Number of Observations 12611.0Definitive/Provisional Indicator 1.0dtype: float64>>> print("眾數Mode", sunspots.mode())眾數Mode Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional Indicator0 18.3 9.2 365.0 1.0>>> print("離散度的標準差Standard Deviation", sunspots.std())離散度的標準差Standard Deviation Yearly Mean Total Sunspot Number 62.019871Yearly Mean Standard Deviation 3.840522Number of Observations 2667.888556Definitive/Provisional Indicator 0.000000dtype: float64>>> print("方差Variance", sunspots.var())方差Variance Yearly Mean Total Sunspot Number 3.846464e+03Yearly Mean Standard Deviation 1.474961e+01Number of Observations 7.117629e+06Definitive/Provisional Indicator 0.000000e+00dtype: float64>>> print("偏態係數Skewness", sunspots.skew())偏態係數Skewness Yearly Mean Total Sunspot Number 0.810785Yearly Mean Standard Deviation 0.546692Number of Observations 1.972382Definitive/Provisional Indicator 0.000000dtype: float64>>> print("峰態係數Kurtosis", sunspots.kurt())峰態係數Kurtosis Yearly Mean Total Sunspot Number -0.127610Yearly Mean Standard Deviation -0.252353Number of Observations 2.728810Definitive/Provisional Indicator 0.000000dtype: float64下期預告:如何聚合數據?
認認真真系統學習數據分析