pd.DataFrame的 assign 方法

2021-03-02 xiao樂

最近身體不太舒服，停更幾天，萬分抱歉，望tie子們見諒。另外，在單位開始著手處理to C業務，需要處理size較大的數據，最近忙於學習、運用pandas和sql，近期總結、梳理一些常用的知識點。

本節知識點：pd.DataFrame中列的添加新column(s)常用兩種途徑

添加新column(s) — 直接賦值

比較加單，直接上code

import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,   
                  index=('Portland', 'Berkeley') )
df
  temp_c
Portland 17.0
Berkeley 25.0
df['temp_f']= df.temp_c * 9 / 5 + 32
# 等價於 df['temp_f']= df['temp_c'] * 9 / 5 + 32
df
         temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
直接賦值的方法，類似於dict  class的  key-value 賦值一般
如果沒有這個column，則產生新的column；如果column存在，更新該column中的value值添加新column(s) — assign method assign method是 pd.DataFrame class 的一個內置方法，用於添加或者更新 pd.DataFrame 的column(s)
這個才是今天的重點知識點
assign的code定義  class DataFrame(NDFrame):
        
    
    def assign(self, **kwargs) -> "DataFrame":
        r"""
        Assign new columns to a DataFrame.

        Returns a new object with all original columns in addition to new ones.
        Existing columns that are re-assigned will be overwritten.

        Parameters
        
        **kwargs : dict of {str: callable or Series}
            The column names are keywords. If the values are
            callable, they are computed on the DataFrame and
            assigned to the new columns. The callable must not
            change input DataFrame (though pandas doesn't check it).
            If the values are not callable, (e.g. a Series, scalar, or array),
            they are simply assigned.

        Returns
        --
        DataFrame
            A new DataFrame with the new columns in addition to
            all the existing columns.

        Notes
        
        Assigning multiple columns within the same ``assign`` is possible.
        Later items in '\*\*kwargs' may refer to newly created or modified
        columns in 'df'; items are computed and assigned into 'df' in order.

        .. versionchanged:: 0.23.0

           Keyword argument order is maintained.

        Examples
        ---
        >>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},
        ...                   index=['Portland', 'Berkeley'])
        >>> df
                  temp_c
        Portland    17.0
        Berkeley    25.0

        Where the value is a callable, evaluated on `df`:

        >>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
                  temp_c  temp_f
        Portland    17.0    62.6
        Berkeley    25.0    77.0

        Alternatively, the same behavior can be achieved by directly
        referencing an existing Series or sequence:

        >>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)
                  temp_c  temp_f
        Portland    17.0    62.6
        Berkeley    25.0    77.0

        You can create multiple columns within the same assign where one
        of the columns depends on another one defined within the same assign:

        >>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
        ...           temp_k=lambda x: (x['temp_f'] +  459.67) * 5 / 9)
                  temp_c  temp_f  temp_k
        Portland    17.0    62.6  290.15
        Berkeley    25.0    77.0  298.15
        """
Basic explanation 功能作用：（1）Returns a new object with all original columns in addition to new ones.  返回 pd.DataFrame，返回值保留原先dataFrame的所有column(s)，同時添加新的column(s)   （2） Existing columns that are re-assigned will be overwritten. 如果 assign所涉及的column(s)已經存在，則更新column(s)中的值
參數說明：**kwargs : dict of {str: callable or Series}  The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn't check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
首先，使用**kwargs相當合理，參數名指定column名，參數值指定column的值
其次，允許參數值為 callable ，而不僅僅是現有的、已經計算好的數據結構，又涉及到first-class function的概念了，pandas自己做了一層包裝，將 callable 的對象作用於當前的pd.DataFrame對象，並將生成的數據賦值給新生成的column
代碼示例 代碼示例1、
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,   
                  index=('Portland', 'Berkeley') )
df
  temp_c
Portland 17.0
Berkeley 25.0
df.assign(  temp_f=df['temp_c'] * 9 / 5 + 32 )
  temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
df
  temp_c
Portland 17.0
Berkeley 25.0
代碼很簡單，不多做解釋，參數名  temp_f 作為新column名，參數值df['temp_c'] * 9 / 5 + 32 作為新column中的值
注意一點：原先df不變
代碼示例2
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,   
                  index=('Portland', 'Berkeley') )
df
  temp_c
Portland 17.0
Berkeley 25.0
temp_f=lambda x: x.temp_c * 9 / 5 + 32
print(temp_f(  df ))
print( type(temp_f(  df )))
Portland    62.6
Berkeley    77.0
Name: temp_c, dtype: float64
<class 'pandas.core.series.Series'>
 df.assign(temp_f=temp_f)
  temp_c temp_f
Portland 17.0 62.6
Berkeley 25.0 77.0
temp_f=lambda x: list(x.temp_c * 9 / 5 + 32)
print(temp_f(  df ))
print( type(temp_f(  df )))
print( df.assign(temp_f=df['temp_c'] * 9 / 5 + 32))
[62.6, 77.0]
<class 'list'>
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0
有點意思吧，注意以下三點
依舊是 參數名= 參數值，這種賦值形式，參數名用於定於新的（或者待更新的）column名
此刻temp_f為一個匿名function，該function的參數默認為當前的pd.DataFrame對象
至於這個callable怎麼定義，可以自由發揮，第一層我就簡單用了pd.DataFrame對象的四則運算，第二次我轉成了list對象，當然還可以添加更為複雜的路徑，但是通常都是當前pd.DataFrame對象幾組columns的統計性的數值
代碼示例3、
import pandas as pd
import numpy as np
df = pd.DataFrame({'temp_c': [17.0, 25.0]} ,   
index=('Portland', 'Berkeley') )
temp_f=lambda x: list(x.temp_c * 9 / 5 + 32)
temp_k = lambda x: (x['temp_f'] +  459.67) * 5 / 9
df.assign(temp_f= temp_f , temp_k = temp_k )
temp_c temp_f temp_k
Portland 17.0 62.6 290.15
Berkeley 25.0 77.0 298.15
有意思，很複雜，其實也很簡單，兩個知識點：
one of the columns depends on another one defined within the same assign，即在添加的columns中 某一列的數值可以依賴於另一列生成有意思的示例 df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]},
                 index=range(1,6))
df
col1 col2 col3
1 a 5 1.3
2 b 6 2.5
3 c 7 3.6
4 d 8 4.6
5 e 9 5.8
df.assign(C=pd.Series(list('def')))
col1 col2 col3 C
1 a 5 1.3 e
2 b 6 2.5 f
3 c 7 3.6 NaN
4 d 8 4.6 NaN
5 e 9 5.8 NaN
思考：為什麼會出現NaN？（提示：索引對齊）assign左右兩邊的索引不一樣，請問結果的索引誰說了算？
df = pd.DataFrame({'col1':list('abcde'),'col2':range(5,10),'col3':[1.3,2.5,3.6,4.6,5.8]},
                 index=range(1,6))
df
col1 col2 col3
1 a 5 1.3
2 b 6 2.5
3 c 7 3.6
4 d 8 4.6
5 e 9 5.8
pd.Series(list('def'))
0    d
1    e
2    f
dtype: object
df["C"] = pd.Series(list('def'))
df
col1 col2 col3 C
1 a 5 1.3 e
2 b 6 2.5 f
3 c 7 3.6 NaN
4 d 8 4.6 NaN
5 e 9 5.8 NaN
好了，我想你已經懂了
image-20200930002808374References [1]   /pandas-docs 參考連結  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html
Motto 日拱一卒,功不唐捐
長按下圖  關注xiao樂

pd.DataFrame的 assign 方法

相關焦點

Pandas-DataFrame基礎知識點總結

Pandas數據結構:DataFrame

Python-Pandas安裝--Series結構和DataFrame結構

Pandas常用的兩種數據類型之「DataFrame」

python數據分析專題 (12):DataFrame

數據集的「乾坤大挪移」:如何使pandas.Dataframe從行式到列式?

DataFrame(4):DataFrame的創建方式

Pandas系列 - DataFrame操作

將Python中的字典數據轉化為DataFrame

深拷貝和淺拷貝之list、dataframe

pandas:Series , Data Frame , Panel

DataFrame(3):DataFrame的創建方式

直接保存 DataFrame 表格到本地,這個「騷操作」你還不知道?

數據分析利器 pandas 系列教程(二):強大的 DataFrame

paipai教你pandas(1) DataFrame的列維度操作

什麼是Pandas的DataFrame?

pandas | 詳解DataFrame中的apply與applymap方法

pandas.DataFrame.describe 方法介紹

Python:Pandas的DataFrame如何按指定list排序