最近身體不太舒服,停更幾天,萬分抱歉,望tie子們見諒。另外,在單位開始著手處理to C業務,需要處理size較大的數據,最近忙於學習、運用pandas和sql,近期總結、梳理一些常用的知識點。
本節知識點:pd.DataFrame中列的添加新column(s)常用兩種途徑
添加新column(s) — 直接賦值比較加單,直接上code
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,
index=('Portland', 'Berkeley') )
dftemp_c
Portland 17.0
Berkeley 25.0df['temp_f']= df.temp_c * 9 / 5 + 32
# 等價於 df['temp_f']= df['temp_c'] * 9 / 5 + 32
dftemp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0直接賦值的方法,類似於dict class的 key-value 賦值一般
如果沒有這個column,則產生新的column;如果column存在,更新該column中的value值添加新column(s) — assign methodassign method是 pd.DataFrame class 的一個內置方法,用於添加或者更新 pd.DataFrame 的column(s)
這個才是今天的重點知識點
assign的code定義class DataFrame(NDFrame):
def assign(self, **kwargs) -> "DataFrame":
r"""
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones.
Existing columns that are re-assigned will be overwritten.
Parameters
**kwargs : dict of {str: callable or Series}
The column names are keywords. If the values are
callable, they are computed on the DataFrame and
assigned to the new columns. The callable must not
change input DataFrame (though pandas doesn't check it).
If the values are not callable, (e.g. a Series, scalar, or array),
they are simply assigned.
Returns
--
DataFrame
A new DataFrame with the new columns in addition to
all the existing columns.
Notes
Assigning multiple columns within the same ``assign`` is possible.
Later items in '\*\*kwargs' may refer to newly created or modified
columns in 'df'; items are computed and assigned into 'df' in order.
.. versionchanged:: 0.23.0
Keyword argument order is maintained.
Examples
---
>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},
... index=['Portland', 'Berkeley'])
>>> df
temp_c
Portland 17.0
Berkeley 25.0
Where the value is a callable, evaluated on `df`:
>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
Alternatively, the same behavior can be achieved by directly
referencing an existing Series or sequence:
>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)
temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
You can create multiple columns within the same assign where one
of the columns depends on another one defined within the same assign:
>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
... temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9)
temp_c temp_f temp_k
Portland 17.0 62.6 290.15
Berkeley 25.0 77.0 298.15
"""
Basic explanation功能作用:(1)Returns a new object with all original columns in addition to new ones. 返回 pd.DataFrame,返回值保留原先dataFrame的所有column(s),同時添加新的column(s) (2) Existing columns that are re-assigned will be overwritten. 如果 assign所涉及的column(s)已經存在,則更新column(s)中的值
參數說明:**kwargs : dict of {str: callable or Series} The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn't check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
首先,使用**kwargs相當合理,參數名指定column名,參數值指定column的值
其次,允許參數值為 callable ,而不僅僅是現有的、已經計算好的數據結構,又涉及到first-class function的概念了,pandas自己做了一層包裝,將 callable 的對象作用於當前的pd.DataFrame對象,並將生成的數據賦值給新生成的column
代碼示例代碼示例1、
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,
index=('Portland', 'Berkeley') )
dftemp_c
Portland 17.0
Berkeley 25.0df.assign( temp_f=df['temp_c'] * 9 / 5 + 32 )temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0dftemp_c
Portland 17.0
Berkeley 25.0代碼很簡單,不多做解釋,參數名 temp_f 作為新column名,參數值df['temp_c'] * 9 / 5 + 32 作為新column中的值
注意一點:原先df不變
代碼示例2
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,
index=('Portland', 'Berkeley') )
dftemp_c
Portland 17.0
Berkeley 25.0temp_f=lambda x: x.temp_c * 9 / 5 + 32
print(temp_f( df ))
print( type(temp_f( df )))Portland 62.6
Berkeley 77.0
Name: temp_c, dtype: float64
<class 'pandas.core.series.Series'>df.assign(temp_f=temp_f)temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0temp_f=lambda x: list(x.temp_c * 9 / 5 + 32)
print(temp_f( df ))
print( type(temp_f( df )))
print( df.assign(temp_f=df['temp_c'] * 9 / 5 + 32))[62.6, 77.0]
<class 'list'>
temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0有點意思吧,注意以下三點
依舊是 參數名= 參數值,這種賦值形式,參數名用於定於新的(或者待更新的)column名
此刻temp_f為一個匿名function,該function的參數默認為當前的pd.DataFrame對象
至於這個callable怎麼定義,可以自由發揮,第一層我就簡單用了pd.DataFrame對象的四則運算,第二次我轉成了list對象,當然還可以添加更為複雜的路徑,但是通常都是當前pd.DataFrame對象幾組columns的統計性的數值
代碼示例3、
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,
index=('Portland', 'Berkeley') )
temp_f=lambda x: list(x.temp_c * 9 / 5 + 32)
temp_k = lambda x: (x['temp_f'] + 459.67) * 5 / 9
df.assign(temp_f= temp_f , temp_k = temp_k )temp_c temp_f temp_k
Portland 17.0 62.6 290.15
Berkeley 25.0 77.0 298.15有意思,很複雜,其實也很簡單,兩個知識點:
one of the columns depends on another one defined within the same assign,即在添加的columns中 某一列的數值可以依賴於另一列生成有意思的示例df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]},
index=range(1,6))
dfcol1 col2 col3
1 a 5 1.3
2 b 6 2.5
3 c 7 3.6
4 d 8 4.6
5 e 9 5.8df.assign(C=pd.Series(list('def')))col1 col2 col3 C
1 a 5 1.3 e
2 b 6 2.5 f
3 c 7 3.6 NaN
4 d 8 4.6 NaN
5 e 9 5.8 NaN思考:為什麼會出現NaN?(提示:索引對齊)assign左右兩邊的索引不一樣,請問結果的索引誰說了算?
df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]},
index=range(1,6))
dfcol1 col2 col3
1 a 5 1.3
2 b 6 2.5
3 c 7 3.6
4 d 8 4.6
5 e 9 5.8pd.Series(list('def'))0 d
1 e
2 f
dtype: objectdf["C"] = pd.Series(list('def'))
dfcol1 col2 col3 C
1 a 5 1.3 e
2 b 6 2.5 f
3 c 7 3.6 NaN
4 d 8 4.6 NaN
5 e 9 5.8 NaN好了,我想你已經懂了
image-20200930002808374References[1] /pandas-docs 參考連結 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
Motto日拱一卒,功不唐捐
長按下圖 關注xiao樂