作者 | 高羊羊羊羊羊楊
來源 | CSDN博客
頭圖 | 付費下載自視覺中國
出品 | CSDN(ID:CSDNnews)
前段時間,湖人當家球星 科比·布萊恩特不幸遇難。這對於無數的球迷來說無疑使晴天霹靂, 他逆天終究也沒能改命,但命運也從來都沒改得了他,曼巴精神會一直延續下去。隨著大數據時代的到來,好像任何事情都可以和大數據這三個字掛鈎。早在很久以前,大數據分析就已經廣泛的應用在運動員職業生涯規劃、醫療、金融等方面,在本文中將會使用Python對球星科比進行對維度分析,向 「老大」 致敬!
前景提要
那天,是2020年1月27日凌晨,我失眠了,足足在床上打滾到4點鐘還是睡不著,解鎖屏幕,盯著刺眼的手機打算刷刷微博,但卻得到了一個令人震驚的消息:球星科比不幸遇難。換做是往常,我當然是舉報三連,這種標題黨罪有應得,但卻刷到了越來越多條類似的消息,直到看到官方發布的消息。
正如我的文案所說,我沒有見過凌晨四點的洛杉磯,可我在凌晨四點聽聞了你去世的消息,1978-2020。
作為球迷,我們能做的只有惋惜與緬懷。不散播謠言,不消費 「曼巴精神」
數據獲取
來源:NBA官方提供了的科比布萊恩特近二十年職業生涯數據資料集(數據量比較龐大,大約有3萬行)
數據處理
翻閱文檔時不難發現其中有很多空缺值,簡單粗暴的方式是直接刪除有空值的行,但為了樣本完整性與預測結果的正確率。
首先我們對投籃距離做一個簡單的異常值檢測,這裡採用的是箱線圖呈現
1#-*- coding: utf-8 -*-2catering_sale = '2.csv' 3data = pd.read_csv(catering_sale, index_col = 'shot_id') #讀取數據,指定「shot_id」列為索引列 4 5import matplotlib.pyplot as plt #導入圖像庫 6plt.rcParams['font.sans-serif'] = ['SimHei'] #用來正常顯示中文標籤 7plt.rcParams['axes.unicode_minus'] = False #用來正常顯示負號 8# 9plt.figure() #建立圖像10p = data.boxplot(return_type='dict') #畫箱線圖,直接使用DataFrame的方法11x = p['fliers'][0].get_xdata() # 'flies'即為異常值的標籤12y = p['fliers'][0].get_ydata()13y.sort() #從小到大排序,該方法直接改變原對象14print('共有30687個數據,其中異常值的個數為{}'.format(len(y)))1516#用annotate添加注釋17#其中有些相近的點,註解會出現重疊,難以看清,需要一些技巧來控制。1819for i in range(len(x)):20 if i>0:21 plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.05 -0.8/(y[i]-y[i-1]),y[i]))22 else:23 plt.annotate(y[i], xy = (x[i],y[i]), xytext=(x[i]+0.08,y[i]))2425plt.show() #展示箱線圖
我們將得到這樣的結果:
根據判斷,該列數據有68個異常值,這裡採取的操作是將這些異常值所在行刪除,其他列屬性同理。
數據整合
將數據導入,並按我們的需求對數據進行合併、添加新列名的操作
1import pandas as pd2 3 4allData = pd.read_csv('data.csv') 5data = allData[allData['shot_made_flag'].notnull()].reset_index() 6 7# 添加新的列名 8data['game_date_DT'] = pd.to_datetime(data['game_date']) 9data['dayOfWeek'] = data['game_date_DT'].dt.dayofweek10data['dayOfYear'] = data['game_date_DT'].dt.dayofyear11data['secondsFromPeriodEnd'] = 60 * data['minutes_remaining'] + data['seconds_remaining']12data['secondsFromPeriodStart'] = 60 * (11 - data['minutes_remaining']) + (60 - data['seconds_remaining'])13data['secondsFromGameStart'] = (data['period'] <= 4).astype(int) * (data['period'] - 1) * 12 * 60 + (14 data['period'] > 4).astype(int) * ((data['period'] - 4) * 5 * 60 + 3 * 12 * 60) + data['secondsFromPeriodStart']1516'''17其中:18secondsFromPeriodEnd 一個周期結束後的秒19secondsFromPeriodStart 一個周期開始時的秒20secondsFromGameStart 一場比賽開始後的秒數21'''2223#對數據進行驗證24print(data.loc[:10, ['period', 'minutes_remaining', 'seconds_remaining', 'secondsFromGameStart']])
運行有如下結果:
看起來還是一切正常的
繪製投籃嘗試圖
根據不同的時間變化(從比賽開始)來繪製投籃的嘗試圖
這裡我們將用到matplotlib包
1import pandas as pd2import numpy as np 3import matplotlib.pyplot as plt 4 5 6plt.rcParams['figure.figsize'] = (16, 16) 7plt.rcParams['font.size'] = 16 8binsSizes = [24, 12, 6] 9plt.figure()1011for k, binSizeInSeconds in enumerate(binsSizes):12 timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.0113 attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins)1415 maxHeight = max(attemptsAsFunctionOfTime) + 3016 barWidth = 0.999 * (timeBins[1] - timeBins[0])17 plt.subplot(len(binsSizes), 1, k + 1)18 plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth)19 plt.title(str(binSizeInSeconds) + ' second time bins')20 plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,21 4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')22 plt.xlim((-20, 3200))23 plt.ylim((0, maxHeight))24 plt.ylabel('attempts')25plt.xlabel('time [seconds from start of game]')26plt.show()
看下效果:
可以看出隨著比賽時間的進行,科比的出手次數呈現增長狀態。
繪製命中率對比圖
這裡們將做一個對比來判斷一下科比的命中率如何
1# 在比賽中,根據時間的函數繪製出投籃精度。2# 繪製精度隨時間變化的函數 3plt.rcParams['figure.figsize'] = (15, 10) 4plt.rcParams['font.size'] = 16 5 6binSizeInSeconds = 20 7timeBins = np.arange(0, 60 * (4 * 12 + 3 * 5), binSizeInSeconds) + 0.01 8attemptsAsFunctionOfTime, b = np.histogram(data['secondsFromGameStart'], bins=timeBins) 9madeAttemptsAsFunctionOfTime, b = np.histogram(data.loc[data['shot_made_flag'] == 1, 'secondsFromGameStart'],10 bins=timeBins)11attemptsAsFunctionOfTime[attemptsAsFunctionOfTime < 1] = 112accuracyAsFunctionOfTime = madeAttemptsAsFunctionOfTime.astype(float) / attemptsAsFunctionOfTime13accuracyAsFunctionOfTime[attemptsAsFunctionOfTime <= 50] = 0 # zero accuracy in bins that don't have enough samples1415maxHeight = max(attemptsAsFunctionOfTime) + 3016barWidth = 0.999 * (timeBins[1] - timeBins[0])1718plt.figure()19plt.subplot(2, 1, 1)20plt.bar(timeBins[:-1], attemptsAsFunctionOfTime, align='edge', width=barWidth);21plt.xlim((-20, 3200))22plt.ylim((0, maxHeight))2324#上面圖的y軸 投籃次數25plt.ylabel('attempts')26plt.title(str(binSizeInSeconds) + ' second time bins')27plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,28 4 * 12 * 60 + 3 * 5 * 60], ymin=0, ymax=maxHeight, colors='r')29plt.subplot(2, 1, 2)30plt.bar(timeBins[:-1], accuracyAsFunctionOfTime, align='edge', width=barWidth);31plt.xlim((-20, 3200))32#下面圖的y軸 命中率33plt.ylabel('accuracy')34plt.xlabel('time [seconds from start of game]')35plt.vlines(x=[0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60, 4 * 12 * 60 + 2 * 5 * 60,36 4 * 12 * 60 + 3 * 5 * 60], ymin=0.0, ymax=0.7, colors='r')37plt.show()
看一下效果怎麼樣
分析可得出科比的投籃命中率大概徘徊在0.4左右,但這並不是我們想要的效果
為了進一步對數據進行挖掘,我們需要使用一些算法了。
GMM聚類
那麼 什麼是GMM聚類呢?
GMM是高斯混合模型(或者是混合高斯模型)的簡稱。大致的意思就是所有的分布可以看做是多個高斯分布綜合起來的結果。這樣一來,任何分布都可以分成多個高斯分布來表示。
因為我們知道,按照大自然中很多現象是遵從高斯(即正態)分布的,但是,實際上,影響一個分布的原因是多個的,甚至有些是人為的,可能每一個影響因素決定了一個高斯分布,多種影響結合起來就是多個高斯分布。(個人理解)
因此,混合高斯模型聚類的原理:通過樣本找到K個高斯分布的期望和方差,那麼K個高斯模型就確定了。在聚類的過程中,不會明確的指定一個樣本屬於哪一類,而是計算這個樣本在某個分布中的可能性。
高斯分布一般還要結合EM算法作為其似然估計算法。
1'''2現在,讓我們繼續我們的初步探索,研究一下科比投籃的空間位置。 3我們將通過構建一個高斯混合模型來實現這一點,該模型試圖對科比的射門位置進行簡單的總結。 4用GMM在科比的投籃位置上對他們的投籃嘗試進行聚類 5''' 6 7numGaussians = 13 8gaussianMixtureModel = mixture.GaussianMixture(n_components=numGaussians, covariance_type='full', 9 init_params='kmeans', n_init=50,10 verbose=0, random_state=5)11gaussianMixtureModel.fit(data.loc[:, ['loc_x', 'loc_y']])1213# 將GMM集群作為欄位添加到數據集中14data['shotLocationCluster'] = gaussianMixtureModel.predict(data.loc[:, ['loc_x', 'loc_y']])
球場可視化
這裡借鑑了MichaelKrueger的excelent腳本裡的draw_court()函數
draw_court()函數
1def draw_court(ax=None, color='black', lw=2, outer_lines=False):2 # 如果沒有提供用於繪圖的axis對象,就獲取當前對象 3 if ax is None: 4 ax = plt.gca() 5 6 # 創建一個NBA的球場 7 # 建一個籃筐 8 # 直徑是18,半徑是9 9 # 7.5在坐標系內10 hoop = Circle((0, 0), radius=7.5, linewidth=lw, color=color, fill=False)1112 # 創建籃筐13 backboard = Rectangle((-30, -7.5), 60, -1, linewidth=lw, color=color)1415 # The paint16 # 為球場外部上色, width=16ft, height=19ft17 outer_box = Rectangle((-80, -47.5), 160, 190, linewidth=lw, color=color,18 fill=False)19 # 為球場內部上色, width=12ft, height=19ft20 inner_box = Rectangle((-60, -47.5), 120, 190, linewidth=lw, color=color,21 fill=False)222324 #創建發球頂弧25 top_free_throw = Arc((0, 142.5), 120, 120, theta1=0, theta2=180,26 linewidth=lw, color=color, fill=False)2728 #創建發球底弧29 bottom_free_throw = Arc((0, 142.5), 120, 120, theta1=180, theta2=0,30 linewidth=lw, color=color, linestyle='dashed')3132 # 這是一個距離籃筐中心4英尺半徑的弧線33 restricted = Arc((0, 0), 80, 80, theta1=0, theta2=180, linewidth=lw,34 color=color)3536 # 三分線37 # 創建邊3pt的線,14英尺長38 corner_three_a = Rectangle((-220, -47.5), 0, 140, linewidth=lw,39 color=color)40 corner_three_b = Rectangle((220, -47.5), 0, 140, linewidth=lw, color=color)4142 # 圓弧到圓心是個圓環,距離為23'9"43 # 調整一下thetal的值,直到它們與三分線對齊44 three_arc = Arc((0, 0), 475, 475, theta1=22, theta2=158, linewidth=lw,45 color=color)464748 # 中場部分49 center_outer_arc = Arc((0, 422.5), 120, 120, theta1=180, theta2=0,50 linewidth=lw, color=color)51 center_inner_arc = Arc((0, 422.5), 40, 40, theta1=180, theta2=0,52 linewidth=lw, color=color)535455 # 要繪製到坐標軸上的球場元素的列表56 court_elements = [hoop, backboard, outer_box, inner_box, top_free_throw,57 bottom_free_throw, restricted, corner_three_a,58 corner_three_b, three_arc, center_outer_arc,59 center_inner_arc]6061 if outer_lines:6263 # 劃出半場線、底線和邊線64 outer_lines = Rectangle((-250, -47.5), 500, 470, linewidth=lw,65 color=color, fill=False)66 court_elements.append(outer_lines)676869 # 將球場元素添加到軸上70 for element in court_elements:71 ax.add_patch(element)7273 return ax
二維高斯圖
建立繪製畫二維高斯圖的函數
Draw2DGaussians()
1def Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages):2 fig, h = plt.subplots() 3 for i, (mean, covarianceMatrix) in enumerate(zip(gaussianMixtureModel.means_, gaussianMixtureModel.covariances_)): 4 # 得到協方差矩陣的特徵向量和特徵值 5 v, w = np.linalg.eigh(covarianceMatrix) 6 v = 2.5 * np.sqrt(v) # go to units of standard deviation instead of variance 用標準差的單位代替方差 7 8 # 計算橢圓角和兩軸長度並畫出它 9 u = w[0] / np.linalg.norm(w[0])10 angle = np.arctan(u[1] / u[0])11 angle = 180 * angle / np.pi # convert to degrees 轉換成度數12 currEllipse = mpl.patches.Ellipse(mean, v[0], v[1], 180 + angle, color=ellipseColors[i])13 currEllipse.set_alpha(0.5)14 h.add_artist(currEllipse)15 h.text(mean[0] + 7, mean[1] - 1, ellipseTextMessages[i], fontsize=13, color='blue')
下面開始繪製2D高斯投籃次數圖,圖中的每個橢圓都是離高斯分布中心2.5個標準差遠的計數,每個藍色的數字代表從該高斯分布觀察到的所佔百分比
1# 顯示投籃嘗試的高斯混合橢圓2plt.rcParams['figure.figsize'] = (13, 10) 3plt.rcParams['font.size'] = 15 4 5ellipseTextMessages = [str(100 * gaussianMixtureModel.weights_[x])[:4] + '%' for x in range(numGaussians)] 6ellipseColors = ['red', 'green', 'purple', 'cyan', 'magenta', 'yellow', 'blue', 'orange', 'silver', 'maroon', 'lime', 7 'olive', 'brown', 'darkblue'] 8Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages) 9draw_court(outer_lines=True)10plt.ylim(-60, 440)11plt.xlim(270, -270)12plt.title('shot attempts')13plt.show()
看一下成果:
我們可以看到,著色後的2D高斯圖中,科比在球場的左側(或者從他看來是右側)做了更多的投籃嘗試。這可能是因為他是右撇子。此外,我們還可以看到,大量的投籃嘗試(16.8%)是直接從籃下進行的,5.06%的額外投籃嘗試是從非常接近籃下的位置投出去的。
它看起來並不完美,但確實顯示了一些有用的東西
對於繪製的每個高斯集群的投籃精度,藍色數字將代表從這個集群中獲取到的準確性,因此我們可以了解哪些是容易的,哪些是困難的。
對於每個集群,計算一下它的精度並繪圖
1plt.rcParams['figure.figsize'] = (13, 10)2plt.rcParams['font.size'] = 15 3 4variableCategories = data['shotLocationCluster'].value_counts().index.tolist() 5 6clusterAccuracy = {} 7for category in variableCategories: 8 shotsAttempted = np.array(data['shotLocationCluster'] == category).sum() 9 shotsMade = np.array(data.loc[data['shotLocationCluster'] == category, 'shot_made_flag'] == 1).sum()10 clusterAccuracy[category] = float(shotsMade) / shotsAttempted1112ellipseTextMessages = [str(100 * clusterAccuracy[x])[:4] + '%' for x in range(numGaussians)]13Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)14draw_court(outer_lines=True)15plt.ylim(-60, 440)16plt.xlim(270, -270)17plt.title('shot accuracy')18plt.show()
看一下效果圖
我們可以清楚地看到投籃距離和精度之間的關係。
繪製二維時空圖
另一個有趣的事實是:科比不僅在右側做了更多的投籃嘗試(從他看來的那邊),而且他在這些投籃嘗試上更擅長
現在讓我們繪製一個科比職業生涯的二維時空圖。在X軸上,將從比賽開始時計時;在y軸上有科比投籃的集群指數(根據集群精度排序);圖片的深度將反映科比在那個特定的時間從那個特定的集群中嘗試的次數;圖中的紅色垂線分割比賽的每節
1# 制科比整個職業生涯比賽中的二維時空直方圖2plt.rcParams['figure.figsize'] = (18, 10) #設置圖像顯示的大小 3plt.rcParams['font.size'] = 18 #字體大小 4 5 6# 根據集群的準確性對它們進行排序 7sortedClustersByAccuracyTuple = sorted(clusterAccuracy.items(), key=operator.itemgetter(1), reverse=True) 8sortedClustersByAccuracy = [x[0] for x in sortedClustersByAccuracyTuple] 910binSizeInSeconds = 1211timeInUnitsOfBins = ((data['secondsFromGameStart'] + 0.0001) / binSizeInSeconds).astype(int)12locationInUintsOfClusters = np.array(13 [sortedClustersByAccuracy.index(data.loc[x, 'shotLocationCluster']) for x in range(data.shape[0])])141516# 建立科比比賽的時空直方圖17shotAttempts = np.zeros((gaussianMixtureModel.n_components, 1 + max(timeInUnitsOfBins)))18for shot in range(data.shape[0]):19 shotAttempts[locationInUintsOfClusters[shot], timeInUnitsOfBins[shot]] += 1202122# 讓y軸有更大的面積,這樣會更明顯23shotAttempts = np.kron(shotAttempts, np.ones((5, 1)))2425# 每節結束的位置26vlinesList = 0.5001 + np.array([0, 12 * 60, 2 * 12 * 60, 3 * 12 * 60, 4 * 12 * 60, 4 * 12 * 60 + 5 * 60]).astype(27 int) / binSizeInSeconds2829plt.figure(figsize=(13, 8)) #設置寬和高30plt.imshow(shotAttempts, cmap='copper', interpolation="nearest") #設置了邊界的模糊度,或者是圖片的模糊度31plt.xlim(0, float(4 * 12 * 60 + 6 * 60) / binSizeInSeconds)32plt.vlines(x=vlinesList, ymin=-0.5, ymax=shotAttempts.shape[0] - 0.5, colors='r')33plt.xlabel('time from start of game [sec]')34plt.ylabel('cluster (sorted by accuracy)')35plt.show()
看一下運行結果:
集群按精度降序排序。高準確度的投籃在最上面,而低準確度的半場投籃在最下面,我們現在可以看到,在第一、第二和第三節中的「最後一秒出手」實際上是從很遠的地方「絕殺」, 然而,有趣的是,在第4節中,最後一秒的投籃並不屬於「絕殺」的投籃群,而是屬於常規的3分投籃(這仍然比較難命中,但不是毫無希望的)。
在以後的分析中,我們將根據投籃屬性來評估投籃難度(如投籃類型和投籃距離)
下面將為投籃難度模型創建一個新表格
1def FactorizeCategoricalVariable(inputDB, categoricalVarName):2 opponentCategories = inputDB[categoricalVarName].value_counts().index.tolist() 3 4 outputDB = pd.DataFrame() 5 for category in opponentCategories: 6 featureName = categoricalVarName + ': ' + str(category) 7 outputDB[featureName] = (inputDB[categoricalVarName] == category).astype(int) 8 9 return outputDB101112featuresDB = pd.DataFrame()13featuresDB['homeGame'] = data['matchup'].apply(lambda x: 1 if (x.find('@') < 0) else 0)14featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'opponent')], axis=1)15featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'action_type')], axis=1)16featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_type')], axis=1)17featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'combined_shot_type')], axis=1)18featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_basic')], axis=1)19featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_area')], axis=1)20featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shot_zone_range')], axis=1)21featuresDB = pd.concat([featuresDB, FactorizeCategoricalVariable(data, 'shotLocationCluster')], axis=1)2223featuresDB['playoffGame'] = data['playoffs']24featuresDB['locX'] = data['loc_x']25featuresDB['locY'] = data['loc_y']26featuresDB['distanceFromBasket'] = data['shot_distance']27featuresDB['secondsFromPeriodEnd'] = data['secondsFromPeriodEnd']2829featuresDB['dayOfWeek_cycX'] = np.sin(2 * np.pi * (data['dayOfWeek'] / 7))30featuresDB['dayOfWeek_cycY'] = np.cos(2 * np.pi * (data['dayOfWeek'] / 7))31featuresDB['timeOfYear_cycX'] = np.sin(2 * np.pi * (data['dayOfYear'] / 365))32featuresDB['timeOfYear_cycY'] = np.cos(2 * np.pi * (data['dayOfYear'] / 365))3334labelsDB = data['shot_made_flag']
根據FeaturesDB表構建模型,並確保它不會過度匹配(即訓練誤差與測試誤差相同)
使用一個額外的分類器
建立一個簡單的模型,並確保它不超載
1randomSeed = 12numFolds = 4 3 4stratifiedCV = model_selection.StratifiedKFold(n_splits=numFolds, shuffle=True, random_state=randomSeed) 5 6mainLearner = ensemble.ExtraTreesClassifier(n_estimators=500, max_depth=5, 7 min_samples_leaf=120, max_features=120, 8 criterion='entropy', bootstrap=False, 9 n_jobs=-1, random_state=randomSeed)1011startTime = time.time()12trainAccuracy = []13validAccuracy = []14trainLogLosses = []15validLogLosses = []16for trainInds, validInds in stratifiedCV.split(featuresDB, labelsDB):17 # 分割訓練和有效的集合18 X_train_CV = featuresDB.iloc[trainInds, :]19 y_train_CV = labelsDB.iloc[trainInds]20 X_valid_CV = featuresDB.iloc[validInds, :]21 y_valid_CV = labelsDB.iloc[validInds]2223 # 訓練24 mainLearner.fit(X_train_CV, y_train_CV)2526 # 作出預測27 y_train_hat_mainLearner = mainLearner.predict_proba(X_train_CV)[:, 1]28 y_valid_hat_mainLearner = mainLearner.predict_proba(X_valid_CV)[:, 1]2930 # 儲存結果31 trainAccuracy.append(accuracy(y_train_CV, y_train_hat_mainLearner > 0.5))32 validAccuracy.append(accuracy(y_valid_CV, y_valid_hat_mainLearner > 0.5))33 trainLogLosses.append(log_loss(y_train_CV, y_train_hat_mainLearner))34 validLogLosses.append(log_loss(y_valid_CV, y_valid_hat_mainLearner))3536print("-----------------------------------------------------")37print("total (train,valid) Accuracy = (%.5f,%.5f). took %.2f minutes" % (38 np.mean(trainAccuracy), np.mean(validAccuracy), (time.time() - startTime) / 60))39print("total (train,valid) Log Loss = (%.5f,%.5f). took %.2f minutes" % (40 np.mean(trainLogLosses), np.mean(validLogLosses), (time.time() - startTime) / 60))41print("-----------------------------------------------------")4243mainLearner.fit(featuresDB, labelsDB)44data['shotDifficulty'] = mainLearner.predict_proba(featuresDB)[:, 1]4546# 為了深入了解,我們來看看特性選擇47featureInds = mainLearner.feature_importances_.argsort()[::-1]48featureImportance = pd.DataFrame(49 np.concatenate((featuresDB.columns[featureInds, None], mainLearner.feature_importances_[featureInds, None]),50 axis=1),51 columns=['featureName', 'importanceET'])5253print(featureImportance.iloc[:30, :])**看看運行結果如何**:
1total (train,valid) Accuracy = (0.67912,0.67860). took 0.29 minutes2total (train,valid) Log Loss = (0.60812,0.61100). took 0.29 minutes 3----------------------------------------------------- 4 featureName importanceET 50 action_type: Jump Shot 0.578036 61 action_type: Layup Shot 0.173274 72 combined_shot_type: Dunk 0.113341 83 homeGame 0.0288043 94 action_type: Dunk Shot 0.0161591105 shotLocationCluster: 9 0.0136386116 combined_shot_type: Layup 0.00949568127 distanceFromBasket 0.0084703138 shot_zone_range: 16-24 ft. 0.0072107149 action_type: Slam Dunk Shot 0.006903161510 combined_shot_type: Jump Shot 0.005925861611 secondsFromPeriodEnd 0.005893911712 action_type: Running Jump Shot 0.005449041813 shotLocationCluster: 11 0.004491251914 locY 0.003885092015 action_type: Driving Layup Shot 0.003647572116 shot_zone_range: Less Than 8 ft. 0.003496152217 combined_shot_type: Tip Shot 0.002603992318 shot_zone_area: Center(C) 0.00115852419 opponent: DEN 0.0008821062520 action_type: Driving Dunk Shot 0.0008481562621 shot_zone_basic: Restricted Area 0.0006500222722 shotLocationCluster: 2 0.0005134762823 action_type: Tip Shot 0.0004899182924 shot_zone_basic: Mid-Range 0.0004873063025 action_type: Pullup Jump shot 0.0004536413126 shot_zone_range: 8-16 ft. 0.0004525743227 timeOfYear_cycX 0.0004322673328 dayOfWeek_cycX 0.000396683429 shotLocationCluster: 8 0.0002540773536Process finished with exit code 0
在這裡想談談科比·布萊恩特在決策過程中的一些問題;為此,我們將收集兩組不同的效果圖,並分析它們之間的差異:
在一次成功的投籃後馬上繼續投籃在一次不成功的投籃後馬上馬上投籃
考慮到科比投進或投失了最後一球,我收集了一些數據
1timeBetweenShotsDict = {}2timeBetweenShotsDict['madeLast'] = [] 3timeBetweenShotsDict['missedLast'] = [] 4 5changeInDistFromBasketDict = {} 6changeInDistFromBasketDict['madeLast'] = [] 7changeInDistFromBasketDict['missedLast'] = [] 8 9changeInShotDifficultyDict = {}10changeInShotDifficultyDict['madeLast'] = []11changeInShotDifficultyDict['missedLast'] = []1213afterMadeShotsList = []14afterMissedShotsList = []1516for shot in range(1, data.shape[0]):1718 # 確保當前的投籃和最後的投籃都在同一場比賽的同一時間段19 sameGame = data.loc[shot, 'game_date'] == data.loc[shot - 1, 'game_date']20 samePeriod = data.loc[shot, 'period'] == data.loc[shot - 1, 'period']2122 if samePeriod and sameGame:23 madeLastShot = data.loc[shot - 1, 'shot_made_flag'] == 124 missedLastShot = data.loc[shot - 1, 'shot_made_flag'] == 02526 timeDifferenceFromLastShot = data.loc[shot, 'secondsFromGameStart'] - data.loc[shot - 1, 'secondsFromGameStart']27 distDifferenceFromLastShot = data.loc[shot, 'shot_distance'] - data.loc[shot - 1, 'shot_distance']28 shotDifficultyDifferenceFromLastShot = data.loc[shot, 'shotDifficulty'] - data.loc[shot - 1, 'shotDifficulty']2930 # check for currupt data points (assuming all samples should have been chronologically ordered)31 # 檢查數據(假設所有樣本都按時間順序排列)32 if timeDifferenceFromLastShot < 0:33 continue3435 if madeLastShot:36 timeBetweenShotsDict['madeLast'].append(timeDifferenceFromLastShot)37 changeInDistFromBasketDict['madeLast'].append(distDifferenceFromLastShot)38 changeInShotDifficultyDict['madeLast'].append(shotDifficultyDifferenceFromLastShot)39 afterMadeShotsList.append(shot)4041 if missedLastShot:42 timeBetweenShotsDict['missedLast'].append(timeDifferenceFromLastShot)43 changeInDistFromBasketDict['missedLast'].append(distDifferenceFromLastShot)44 changeInShotDifficultyDict['missedLast'].append(shotDifficultyDifferenceFromLastShot)45 afterMissedShotsList.append(shot)4647afterMissedData = data.iloc[afterMissedShotsList, :]48afterMadeData = data.iloc[afterMadeShotsList, :]4950shotChancesListAfterMade = afterMadeData['shotDifficulty'].tolist()51totalAttemptsAfterMade = afterMadeData.shape[0]52totalMadeAfterMade = np.array(afterMadeData['shot_made_flag'] == 1).sum()5354shotChancesListAfterMissed = afterMissedData['shotDifficulty'].tolist()55totalAttemptsAfterMissed = afterMissedData.shape[0]56totalMadeAfterMissed = np.array(afterMissedData['shot_made_flag'] == 1).sum()
柱狀圖
為他們繪製「上次投籃後的時間」的柱狀圖
1plt.rcParams['figure.figsize'] = (13, 10)2 3jointHist, timeBins = np.histogram(timeBetweenShotsDict['madeLast'] + timeBetweenShotsDict['missedLast'], bins=200) 4barWidth = 0.999 * (timeBins[1] - timeBins[0]) 5 6timeDiffHist_GivenMadeLastShot, b = np.histogram(timeBetweenShotsDict['madeLast'], bins=timeBins) 7timeDiffHist_GivenMissedLastShot, b = np.histogram(timeBetweenShotsDict['missedLast'], bins=timeBins) 8maxHeight = max(max(timeDiffHist_GivenMadeLastShot), max(timeDiffHist_GivenMissedLastShot)) + 30 910plt.figure()11plt.subplot(2, 1, 1)12plt.bar(timeBins[:-1], timeDiffHist_GivenMadeLastShot, width=barWidth)13plt.xlim((0, 500))14plt.ylim((0, maxHeight))15plt.title('made last shot')16plt.ylabel('counts')17plt.subplot(2, 1, 2)18plt.bar(timeBins[:-1], timeDiffHist_GivenMissedLastShot, width=barWidth)19plt.xlim((0, 500))20plt.ylim((0, maxHeight))21plt.title('missed last shot')22plt.xlabel('time since last shot')23plt.ylabel('counts')24plt.show()
看一下運行結果:
從圖中可以看出:科比投了一個球之後有些著急去投下一個,而圖中的一些比較平緩的值可能是球權在另一隻隊伍手中,需要一些時間來奪回。
累計柱狀圖
為了更好地可視化柱狀圖之間的差異,我們來看看累積柱狀圖。
1plt.rcParams['figure.figsize'] = (13, 6)2 3timeDiffCumHist_GivenMadeLastShot = np.cumsum(timeDiffHist_GivenMadeLastShot).astype(float) 4timeDiffCumHist_GivenMadeLastShot = timeDiffCumHist_GivenMadeLastShot / max(timeDiffCumHist_GivenMadeLastShot) 5timeDiffCumHist_GivenMissedLastShot = np.cumsum(timeDiffHist_GivenMissedLastShot).astype(float) 6timeDiffCumHist_GivenMissedLastShot = timeDiffCumHist_GivenMissedLastShot / max(timeDiffCumHist_GivenMissedLastShot) 7 8maxHeight = max(timeDiffCumHist_GivenMadeLastShot[-1], timeDiffCumHist_GivenMissedLastShot[-1]) 910plt.figure()11madePrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMadeLastShot, label='made Prev')12plt.xlim((0, 500))13missedPrev = plt.plot(timeBins[:-1], timeDiffCumHist_GivenMissedLastShot, label='missed Prev')14plt.xlim((0, 500))15plt.ylim((0, 1))16plt.title('cumulative density function - CDF')17plt.xlabel('time since last shot')18plt.legend(loc='lower right')19plt.show()
運行效果如下:
雖然可以觀察到密度有差異 ,但好像不太清楚,所以還是轉換成高斯格式來顯示數據吧
1# 顯示投中後和失球後的投籃次數2plt.rcParams['figure.figsize'] = (13, 10) 3 4variableCategories = afterMadeData['shotLocationCluster'].value_counts().index.tolist() 5clusterFrequency = {} 6for category in variableCategories: 7 shotsAttempted = np.array(afterMadeData['shotLocationCluster'] == category).sum() 8 clusterFrequency[category] = float(shotsAttempted) / afterMadeData.shape[0] 910ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]11Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)12draw_court(outer_lines=True)13plt.ylim(-60, 440)14plt.xlim(270, -270)15plt.title('after made shots')1617variableCategories = afterMissedData['shotLocationCluster'].value_counts().index.tolist()18clusterFrequency = {}19for category in variableCategories:20 shotsAttempted = np.array(afterMissedData['shotLocationCluster'] == category).sum()21 clusterFrequency[category] = float(shotsAttempted) / afterMissedData.shape[0]2223ellipseTextMessages = [str(100 * clusterFrequency[x])[:4] + '%' for x in range(numGaussians)]24Draw2DGaussians(gaussianMixtureModel, ellipseColors, ellipseTextMessages)25draw_court(outer_lines=True)26plt.ylim(-60, 440)27plt.xlim(270, -270)28plt.title('after missed shots')29plt.show()30
讓我們來看看最終結果:
結論
現在很明顯,在投丟一個球後,科比更可能直接從籃下投出下一球。在圖中也可以看出,科比在投藍進球後,下一球更有可能嘗試投個三分球,但本次案例中並沒有有效的數據可以證明科比有熱手效應。不難看出,科比還是一個注重籃下以及罰球線周邊功夫的球員,而且是一個十分自信的領袖,不愧為我們的老大!
需要改進的地方
本次獲取到的數據集十分龐大,裡面的內容也很充足,甚至包括了每一種投籃姿勢、上籃姿勢的詳細數據,對於本數據中還未挖掘到的信息各位讀者如果有興趣可以自行嘗試,相信一定會收穫滿滿!
註:可能本次分析中存在一些問題,還請各位讀者指正,感謝閱讀。
原文連結:https://blog.csdn.net/weixin_43656359/article/details/104586776