2020年數據分析必知必會(八):使用pandas查詢數據和統計分析的應用(短小但強大)

2021-03-02 AI科技與算法編程

認認真真系統學習數據分析

本文繼續學習Python數據分析知識，前期的知識點可點擊以下藍色字體連結進行回看複習：

數據分析開篇：一個簡單的應用（2019/11/04）

2020年數據分析必知必會（一）：NumPy數組

2020年數據分析必知必會（二）：NumPy摘要----文章末尾附Python

2020年數據分析必知必會（三）：數組的形狀和屬性（有福利贈予）

數據分析必知必必會（四）：數組的轉換，視圖，拷貝，索引和廣播（這裡的「廣播」是一個數組的應用：數據處理舊手機鈴聲）

2020年數據分析必知必會（五）：統計和線性代數（使用Numpy與Scipy實現）

2020年數據分析必知必會（六）：離散式複製的創建（以北京最近的豬肉價格為例子）

2020年數據分析必知必會（七）：pandas入門與數據結構基礎

廢話不多說，直接上乾貨

....

正文開始

1、pandas如何查詢數據？

從前面的學習我們已經知道，pandas的DataFrame數據結構類似於關係資料庫類型，那麼查詢方式也就如出一轍了。

數據的背景（你也可選擇其他數據作為例子）：太陽黑子

下面以一個眾所周知的太陽黑子爆發數據為例子，如圖（來源百度）

太陽的光球表面有時會出現一些暗的區域，它是磁場聚集的地方，這就是太陽黑子。

獲取太陽黑子數據：

太陽黑子數據網站：https://www.quandl.com/data/SIDC/SUNSPOTS_13-Total-Sunspot-Numbers-13-Month

網站每天使用權限：每天使用Python下載該網站的數據最多調用50次。

Python下載黑子數據需要的模塊命令:（win+r+cmd）

pip install Quandl 或 python -m pip install Quandl

這裡的Quandl的API接口是免費的，因此為了方便python公司將Quandl作為模塊導入黑子爆發的數據。注意：使用次數超過需要註冊。

（1）、下載數據：2018至2019年間13個月的太陽黑子總數

（下載速度慢，請耐心等待）

>>> import quandl>>> sunspots = quandl.get("SIDC/SUNSPOTS_13")#2019年的數據SIDC/SUNSPOTS_13 #2018"SIDC/SUNSPOTS_A">>> sunspots 13-Month Smoothed Total Sunspot Number ... Definitive/Provisional IndicatorDate ...1749-01-31 NaN ... 1.01749-02-28 NaN ... 1.01749-03-31 NaN ... 1.01749-04-30 NaN ... 1.01749-05-31 NaN ... 1.0... ... ... ...2019-06-30 NaN ... 0.02019-07-31 NaN ... 0.02019-08-31 NaN ... 0.02019-09-30 NaN ... 0.02019-10-31 NaN ... 0.0
[3250 rows x 4 columns]

順便也下了2017至2018的，主要是覺得上述數據有NaN空值

>>> import quandl>>> sunspots = quandl.get("SIDC/SUNSPOTS_A")>>> sunspots Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1700-12-31 8.3 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01702-12-31 26.7 NaN NaN 1.01703-12-31 38.3 NaN NaN 1.01704-12-31 60.0 NaN NaN 1.0... ... ... ... ...2014-12-31 113.3 8.0 5273.0 1.02015-12-31 69.8 6.4 8903.0 1.02016-12-31 39.8 3.9 9940.0 1.02017-12-31 21.7 2.5 11444.0 1.02018-12-31 7.0 1.1 12611.0 1.0
[319 rows x 4 columns]

（2）、指定導出數據的開頭和結尾最後的幾行數據

方法：

使用函數head(n)和tail(n)分別下載數據的前n行和後n行，其中n為你要下載的行數。

假設參數n=4,那麼就有：

import quandlsunspots = quandl.get("SIDC/SUNSPOTS_A")#2019年的數據SIDC/SUNSPOTS_13 #2018"SIDC/SUNSPOTS_A"print(sunspots)print("前四行數據為：",sunspots.head(4))print("後四行數據為：",sunspots.tail(4))

執行結果：

>> print("前四行數據為：",sunspots.head(4))前四行數據為：Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1700-12-31 8.3 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01702-12-31 26.7 NaN NaN 1.01703-12-31 38.3 NaN NaN 1.0>>> print("後四行數據為：",sunspots.tail(4))後四行數據為：Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate2015-12-31 69.8 6.4 8903.0 1.02016-12-31 39.8 3.9 9940.0 1.02017-12-31 21.7 2.5 11444.0 1.02018-12-31 7.0 1.1 12611.0 1.0>>>

（3）、查詢最近2018年最後一天的太陽黑子的數據

（注意：只能查一個數據，而且是每月最後一天）

last_data = sunspots.index[-1]print("最近數據:",sunspots.loc[last_data])

執行結果：

>>> last_data = sunspots.index[-1]>>> print("最近數據:",sunspots.loc[last_data])最近數據: Yearly Mean Total Sunspot Number 7.0Yearly Mean Standard Deviation 1.1Number of Observations 12611.0Definitive/Provisional Indicator 1.0Name: 2018-12-31 00:00:00, dtype: float64

（4）、查詢指定日期範圍內的數據

（切記：按照年月日格式來查，且數據結果不包括範圍的區間的端點值）

假設我想查2008年8月8日到2018年1月1日中最後一月最後一天的數據，日期格式為

代碼如下：

print("查找指定的日期數據",sunspots["20080808":"20180101"])

執行結果：（可以看到都是12月31日）

>>> print("查找指定的日期數據",sunspots["20080808":"20180101"])查找指定的日期數據 Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate2008-12-31 4.2 2.5 6644.0 1.02009-12-31 4.8 2.5 6465.0 1.02010-12-31 24.9 3.4 6328.0 1.02011-12-31 80.8 6.7 6077.0 1.02012-12-31 84.5 6.7 5753.0 1.02013-12-31 94.0 6.9 5347.0 1.02014-12-31 113.3 8.0 5273.0 1.02015-12-31 69.8 6.4 8903.0 1.02016-12-31 39.8 3.9 9940.0 1.02017-12-31 21.7 2.5 11444.0 1.0

（5）、指定索引來查詢

print("指定索引查詢",sunspots.iloc[[2,1,0,-1,-2]])

執行結果：

>>> print("指定索引查詢",sunspots.iloc[[2,1,0,-1,-2]])指定索引查詢 Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1702-12-31 26.7 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01700-12-31 8.3 NaN NaN 1.02018-12-31 7.0 1.1 12611.0 1.02017-12-31 21.7 2.5 11444.0 1.0>>>

如果按照順序，也可這麼來輸出：

>>> print("指定索引查詢",sunspots.iloc[[-2,-1,0,1,2]])指定索引查詢 Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate2017-12-31 21.7 2.5 11444.0 1.02018-12-31 7.0 1.1 12611.0 1.01700-12-31 8.3 NaN NaN 1.01701-12-31 18.3 NaN NaN 1.01702-12-31 26.7 NaN NaN 1.0>>>

（6）、查詢指定變量值

換句話說就是查詢指定行和列對應位置的數值，類似矩陣或數組中查詢指定行列位置的元素。

print("查詢第3行第4列元素", sunspots.iloc[2, 3])print("查詢第2行第1列元素", sunspots.iat[1, 0])

執行結果：

>>> print("查詢第3行第4列元素", sunspots.iloc[2, 3])查詢第3行第4列元素 1.0>>> print("查詢第2行第1列元素", sunspots.iat[1, 0])查詢第2行第1列元素 18.3

（7）、查詢布爾型變量

這裡需要使用平均值函數:mean()

下面查詢各個大於平均值的數值

print("Boolean selection",sunspots[sunspots > sunspots.mean()])

執行結果：

>>> print("Boolean selection",sunspots[sunspots > sunspots.mean()])Boolean selection Yearly Mean Total Sunspot Number Yearly Mean Standard Deviation Number of Observations Definitive/Provisional IndicatorDate1700-12-31 NaN NaN NaN NaN1701-12-31 NaN NaN NaN NaN1702-12-31 NaN NaN NaN NaN1703-12-31 NaN NaN NaN NaN1704-12-31 NaN NaN NaN NaN... ... ... ... ...2014-12-31 113.3 8.0 5273.0 NaN2015-12-31 NaN NaN 8903.0 NaN2016-12-31 NaN NaN 9940.0 NaN2017-12-31 NaN NaN 11444.0 NaN2018-12-31 NaN NaN 12611.0 NaN
[319 rows x 4 columns]

2、利用pandas的DataFrmae進行統計計算

為了方便，這裡先給出統計函數的一些描述：

idxmin 最小值的索引值

idxmax 最大值的索引值

describe 一次性多種維度統計

count 非NA值的數量

min 最小值

max 最大值

argmin 最小值的索引位置

argmax 最大值的索引位置

sum 總和

mean 平均數

median 算術中位數

mad 根據平均值計算平均絕對離差

var 樣本值的方差

std 樣本值的標準差

skew 樣本值的偏度（三階矩陣）

kurt 樣本值的峰度（四階矩陣）

cumsum 樣本值的累積和

cummin、cummax 樣本值的最大值、最小值

cumprod 樣本值的累計積

diff 計算一階差分

pct_change 計算百分數變化

下面舉出幾個例子說明上述用途，其他類似去使用，代碼如下：

import quandl#統計計算# Data from http:# PyPi url https:sunspots = quandl.get("SIDC/SUNSPOTS_A")print("Describe", sunspots.describe())print("非NAN數值的數量Non NaN observations",sunspots.count())print("平均絕對標準差MAD", sunspots.mad())print("中位數Median", sunspots.median())print("Min", sunspots.min())print("Max", sunspots.max())print("眾數Mode", sunspots.mode())print("離散度的標準差Standard Deviation", sunspots.std())print("方差Variance", sunspots.var())print("偏態係數Skewness", sunspots.skew())print("峰態係數Kurtosis", sunspots.kurt())
執行結果：
>>> import quandl>>> ... ... ...>>> sunspots = quandl.get("SIDC/SUNSPOTS_A")>>> print("Describe", sunspots.describe())Describe        Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional Indicatorcount                        319.000000                      201.000000              201.000000                             319.0mean                          78.970533                        7.947761             1572.751244                               1.0std                           62.019871                        3.840522             2667.888556                               0.0min                            0.000000                        1.100000              150.000000                               1.025%                           24.800000                        4.700000              365.000000                               1.050%                           65.800000                        7.600000              365.000000                               1.075%                          115.750000                       10.400000              366.000000                               1.0max                          269.300000                       19.100000            12611.000000                               1.0>>> print("非NAN數值的數量Non NaN observations",sunspots.count())非NAN數值的數量Non NaN observations Yearly Mean Total Sunspot Number    319Yearly Mean Standard Deviation      201Number of Observations              201Definitive/Provisional Indicator    319dtype: int64>>> print("平均絕對標準差MAD", sunspots.mad())平均絕對標準差MAD Yearly Mean Total Sunspot Number      50.954279Yearly Mean Standard Deviation         3.155848Number of Observations              1990.750773Definitive/Provisional Indicator       0.000000dtype: float64>>> print("中位數Median", sunspots.median())中位數Median Yearly Mean Total Sunspot Number     65.8Yearly Mean Standard Deviation        7.6Number of Observations              365.0Definitive/Provisional Indicator      1.0dtype: float64>>> print("Min", sunspots.min())Min Yearly Mean Total Sunspot Number      0.0Yearly Mean Standard Deviation        1.1Number of Observations              150.0Definitive/Provisional Indicator      1.0dtype: float64>>> print("Max", sunspots.max())Max Yearly Mean Total Sunspot Number      269.3Yearly Mean Standard Deviation         19.1Number of Observations              12611.0Definitive/Provisional Indicator        1.0dtype: float64>>> print("眾數Mode", sunspots.mode())眾數Mode    Yearly Mean Total Sunspot Number  Yearly Mean Standard Deviation  Number of Observations  Definitive/Provisional Indicator0                              18.3                             9.2                   365.0                               1.0>>> print("離散度的標準差Standard Deviation", sunspots.std())離散度的標準差Standard Deviation Yearly Mean Total Sunspot Number      62.019871Yearly Mean Standard Deviation         3.840522Number of Observations              2667.888556Definitive/Provisional Indicator       0.000000dtype: float64>>> print("方差Variance", sunspots.var())方差Variance Yearly Mean Total Sunspot Number    3.846464e+03Yearly Mean Standard Deviation      1.474961e+01Number of Observations              7.117629e+06Definitive/Provisional Indicator    0.000000e+00dtype: float64>>> print("偏態係數Skewness", sunspots.skew())偏態係數Skewness Yearly Mean Total Sunspot Number    0.810785Yearly Mean Standard Deviation      0.546692Number of Observations              1.972382Definitive/Provisional Indicator    0.000000dtype: float64>>> print("峰態係數Kurtosis", sunspots.kurt())峰態係數Kurtosis Yearly Mean Total Sunspot Number   -0.127610Yearly Mean Standard Deviation     -0.252353Number of Observations              2.728810Definitive/Provisional Indicator    0.000000dtype: float64
下期預告：如何聚合數據？
認認真真系統學習數據分析

2020年數據分析必知必會(八):使用pandas查詢數據和統計分析的應用(短小但強大)

相關焦點

強大的matlab數據科學擬合庫cftool——直接導入數據進行擬合

Python數據分析之pandas數據讀寫

Python數據分析:pandas讀取和寫入數據

【數據分析】Pandas

pandas | 使用pandas進行數據處理——DataFrame篇

《pandas數據讀取》

使用pandas分析1976年至2010年的美國大選的投票數據

使用Pandas數據處理與分析

使用Pandas進行數據處理

用Python做數據分析:Pandas常用數據查詢語法

想學數據分析卻不知道看什麼書,為你推薦精選書單

數據分析必知必會(四):數組的轉換、視圖、拷貝、索引和廣播(這裡的「廣播」是一個數組的應用:數據處理舊手機鈴聲)

Excel數據統計分析技巧

提高數據分析能力,你不得不看的33本書|推薦收藏

如何使用Pandas-Profiling進行探索性數據分析

1000+位資深牛人推薦的數據分析書

機場運行數據統計分析與運用

Python做數據分析-簡潔、易讀、強大

python數據分析之pandas常用命令整理

python數據分析萬字乾貨!一個數據集全方位解讀pandas