點擊上方Python知識圈,設為星標
來源:Python數據科學
作者:東哥起飛
for是所有程式語言的基礎語法,初學者為了快速實現功能,依懶性較強。但如果從運算時間性能上考慮可能不是特別好的選擇。本次介紹幾個常見的提速方法,一個比一個快,了解pandas本質,才能知道如何提速。>>> import pandas as pd
# 導入數據集
>>> df = pd.read_csv('demand_profile.csv')
>>> df.head()
date_time energy_kwh
0 1/1/13 0:00 0.586
1 1/1/13 1:00 0.580
2 1/1/13 2:00 0.572
3 1/1/13 3:00 0.596
4 1/1/13 4:00 0.592
因此,如果你不知道如何提速,那正常第一想法可能就是用apply方法寫一個函數,函數裡面寫好時間條件的邏輯代碼。
def apply_tariff(kwh, hour):
"""計算每個小時的電費"""
if 0 <= hour < 7:
rate = 12
elif 7 <= hour < 17:
rate = 20
elif 17 <= hour < 24:
rate = 28
else:
raise ValueError(f'Invalid hour: {hour}')
return rate * kwh
>>> # 不贊同這種操作
>>> @timeit(repeat=3, number=100)
... def apply_tariff_loop(df):
... """用for循環計算enery cost,並添加到列表"""
... energy_cost_list = []
... for i in range(len(df)):
... # 獲取用電量和時間(小時)
... energy_used = df.iloc[i]['energy_kwh']
... hour = df.iloc[i]['date_time'].hour
... energy_cost = apply_tariff(energy_used, hour)
... energy_cost_list.append(energy_cost)
... df['cost_cents'] = energy_cost_list
...
>>> apply_tariff_loop(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_loop` ran in average of 3.152 seconds.
>>> @timeit(repeat=3, number=100)
... def apply_tariff_iterrows(df):
... energy_cost_list = []
... for index, row in df.iterrows():
... # 獲取用電量和時間(小時)
... energy_used = row['energy_kwh']
... hour = row['date_time'].hour
... # 添加cost列表
... energy_cost = apply_tariff(energy_used, hour)
... energy_cost_list.append(energy_cost)
... df['cost_cents'] = energy_cost_list
...
>>> apply_tariff_iterrows(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_iterrows` ran in average of 0.713 seconds.
>>> @timeit(repeat=3, number=100)
... def apply_tariff_withapply(df):
... df['cost_cents'] = df.apply(
... lambda row: apply_tariff(
... kwh=row['energy_kwh'],
... hour=row['date_time'].hour),
... axis=1)
...
>>> apply_tariff_withapply(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_withapply` ran in average of 0.272 seconds.
# 將date_time列設置為DataFrame的索引
df.set_index('date_time', inplace=True)
@timeit(repeat=3, number=100)
def apply_tariff_isin(df):
# 定義小時範圍Boolean數組
peak_hours = df.index.hour.isin(range(17, 24))
shoulder_hours = df.index.hour.isin(range(7, 17))
off_peak_hours = df.index.hour.isin(range(0, 7))
# 使用上面apply_traffic函數中的定義
df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12
>>> apply_tariff_isin(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_isin` ran in average of 0.010 seconds.
[False, False, False, ..., True, True, True]
@timeit(repeat=3, number=100)
def apply_tariff_cut(df):
cents_per_kwh = pd.cut(x=df.index.hour,
bins=[0, 7, 17, 24],
include_lowest=True,
labels=[12, 20, 28]).astype(int)
df['cost_cents'] = cents_per_kwh * df['energy_kwh']
>>> apply_tariff_cut(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_cut` ran in average of 0.003 seconds.
@timeit(repeat=3, number=100)
def apply_tariff_digitize(df):
prices = np.array([12, 20, 28])
bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
df['cost_cents'] = prices[bins] * df['energy_kwh'].values
>>> apply_tariff_digitize(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_digitize` ran in average of 0.002 seconds.
添加pk哥個人微信即送Python資料
→ Python知識點100題的PDF
→ Python相關的電子書10本
記得備註:「100題」
↓點擊閱讀原文查看pk哥原創視頻