顧老師新書《全棧軟體測試工程師寶典》
https://item.m.jd.com/product/10023427978355.html
以前兩本書的網上購買地址:
《軟體測試技術實戰設計,工具及管理》:
https://item.jd.com/34295655089.html
《基於Django的電子商務網站》:
https://item.jd.com/12082665.html
3.邏輯回歸
大家都知道符號函數:
f(x)=0 (x=0)
=-1 (x<0)
=1 (x>1)
下面這個函數是符號函數的一個變種:
g(x) =0.5 (x=0)
=0 (x<0)
=1 (x>1)
函數,正好符合這個函數與這個函數相吻合,且他是一個連續函數.函數曲線如下:
我們把這個函數叫做邏輯函數,由y=wx,我們令z= wx, 即,這樣當y=0, g(x)』=0.5; y>0, g(x)』>0.5且趨於1;y<0, g(x)』<0.5且趨於0,從而達到二分類的目的。sklearn.linear_model通過LogisticRegression類實現邏輯回歸。
from sklearn.linear_model import LogisticRegressiondef useing_sklearn_datasets_for_LogisticRegression(): cancer = datasets.load_breast_cancer() X = cancer.data y = cancer.target print("X的shape={},正樣本數:{},負樣本數:{}".format(X.shape, y[y == 1].shape[0], y[y == 0].shape[0])) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = LogisticRegression() model.fit(X_train, y_train) train_score = model.score(X_train, y_train) test_score = model.score(X_test, y_test) print("乳腺癌訓練集得分:{trs:.2f},乳腺癌測試集得分:{tss:.2f}".format(trs=train_score, tss=test_score))
輸出X的shape=(569, 30),正樣本數:357,負樣本數:212乳腺癌訓練集得分:0.95,乳腺癌測試集得分:0.95這個結果還是非常滿意的。
4.嶺回歸
嶺回歸(英文名:Ridgeregression, Tikhonov regularization)是一種專用於共線性數據分析的有偏估計回歸方法,實質上是一種改良的最小二乘估計法,通過放棄最小二乘法的無偏性,以損失部分信息、降低精度為代價獲得回歸係數更為符合實際、更可靠的回歸方法,對病態數據的擬合要強於最小二乘法。嶺回歸通過犧牲訓練集得分,獲得測試集得分。採用Ridge函數實現
from sklearn.linear_model import Ridgedef useing_sklearn_datasets_for_Ridge(): X,y = datasets.load_diabetes().data,datasets.load_diabetes().target X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3) lr = LinearRegression().fit(X_train,y_train) ridge = Ridge().fit(X_train,y_train) print('alpha=1,糖尿病訓練集得分: {:.2f}'.format(ridge.score(X_train,y_train))) print('alpha=1,糖尿病測試集得分: {:.2f}'.format(ridge.score(X_test,y_test))) ridge10 = Ridge(alpha=10).fit(X_train,y_train) print('alpha=10,糖尿病訓練集得分: {:.2f}'.format(ridge10.score(X_train,y_train))) print('alpha=10,糖尿病測試集得分: {:.2f}'.format(ridge10.score(X_test,y_test))) ridge01 = Ridge(alpha=0.1).fit(X_train,y_train) print('alpha=0.1,糖尿病訓練集得分: {:.2f}'.format(ridge01.score(X_train,y_train))) print('alpha=0.1,糖尿病測試集得分: {:.2f}'.format(ridge01.score(X_test,y_test)))輸出
alpha=1,糖尿病訓練集得分: 0.43alpha=1,糖尿病測試集得分: 0.43alpha=10,糖尿病訓練集得分: 0.14alpha=10,糖尿病測試集得分: 0.16alpha=0.1,糖尿病訓練集得分: 0.52alpha=0.1,糖尿病測試集得分: 0.47通過下表分析一下各個alpha下訓練集和測試集下的得分。
alpha
訓練集得分
測試集得分
1
0.43
0.43
10
0.14
0.16
0.1
0.52
0.47
線性回歸
0.54
0.45
plt.plot(ridge.coef_,'s',label='Ridge alpha=1')plt.plot(ridge10.coef_,'^',label='Ridge alpha=10')plt.plot(ridge01.coef_,'v',label='Ridge alpha=0.1')plt.plot(lr.coef_,'o',label='Linear Regression')plt.xlabel('coefficient index')plt.ylabel('coefficient magnitude')plt.hlines(0,0,len(lr.coef_))plt.show()alpha越大,收斂性越好
from sklearn.model_selection import learning_curve,KFolddef plot_learning_curve(est, X, y):tarining_set_size,train_scores,test_scores = learning_curve( est,X,y,train_sizes=np.linspace(.1,1,20),cv=KFold(20,shuffle=True,random_state=1))estimator_name = est.__class__.__name__line = plt.plot(tarining_set_size,train_scores.mean(axis=1),'--',label="training"+estimator_name)plt.plot(tarining_set_size,test_scores.mean(axis=1),'-',label="test"+estimator_name,c=line[0].get_color())plt.xlabel('training set size')plt.ylabel('Score')plt.ylim(0,1.1)plot_learning_curve(Ridge(alpha=1), X,y) plot_learning_curve(LinearRegression(), X,y) plt.legend(loc=(0,1.05),ncol=2,fontsize=11) plt.show()
以上結果說明:
訓練集得分比測試集得分要高;
嶺回歸測試集得分比線性回歸測試集得分要低;
嶺回歸測試集得分與訓練集得分差不多;
訓練集小的時候,線性模型都學不到什麼東西;
訓練集加大,兩個得分相同。
5.套索回歸
套索回歸(英文名Lasso Regression)略同於嶺回歸。在實踐中,嶺回歸與套索回歸首先嶺回歸。但是,如果特徵特別多,而某些特徵更重要,具有選擇性,那就選擇Lasso可能更好。採用Lasso函數實現。
from sklearn.linear_model import Lassodef useing_sklearn_datasets_for_Lasso(): X,y = datasets.load_diabetes().data,datasets.load_diabetes().target X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3) lasso01 = Lasso().fit(X_train,y_train) print('alpha=1,糖尿病訓練集得分: {:.2f}'.format(lasso01.score(X_train,y_train))) print('alpha=1,糖尿病測試集得分: {:.2f}'.format(lasso01.score(X_test,y_test))) print('alpha=1,套索回歸特徵數: {}'.format(np.sum(lasso01.coef_!=0))) lasso01 = Lasso(alpha=0.1,max_iter=100000).fit(X_train,y_train) print('alpha=0.1,max_iter=100000,糖尿病訓練集得分: {:.2f}'.format(lasso01.score(X_train,y_train))) print('alpha=0.1,max_iter=100000,糖尿病測試集得分: {:.2f}'.format(lasso01.score(X_test,y_test))) print('alpha=1,套索回歸特徵數: {}'.format(np.sum(lasso01.coef_!=0))) lasso01 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train) print('alpha=0.0001,max_iter=100000,糖尿病訓練集得分: {:.2f}'.format(lasso01.score(X_train,y_train))) print('alpha=0.0001,max_iter=100000,糖尿病測試集得分: {:.2f}'.format(lasso01.score(X_test,y_test))) print('alpha=1,套索回歸特徵數: {}'.format(np.sum(lasso01.coef_!=0)))輸出
alpha=1,糖尿病訓練集得分: 0.37alpha=1,糖尿病測試集得分: 0.38alpha=1,套索回歸特徵數: 3alpha=0.1,max_iter=100000,糖尿病訓練集得分: 0.52alpha=0.1,max_iter=100000,糖尿病測試集得分: 0.48alpha=1,套索回歸特徵數: 7alpha=0.0001,max_iter=100000,糖尿病訓練集得分: 0.53alpha=0.0001,max_iter=100000,糖尿病測試集得分: 0.45alpha=1,套索回歸特徵數: 10比較嶺回歸與套索回歸
def Ridge_VS_Lasso(): X,y = datasets.load_diabetes().data,datasets.load_diabetes().target X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=8,test_size=0.3) lasso = Lasso(alpha=1,max_iter=100000).fit(X_train,y_train) plt.plot(lasso.coef_,'s',label='lasso alpha=1') lasso01 = Lasso(alpha=0. 1,max_iter=100000).fit(X_train,y_train) plt.plot(lasso01.coef_,'^',label='lasso alpha=0. 1') lasso0001 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train,y_train) plt.plot(lasso0001.coef_,'v',label='lasso alpha=0.001') ridge01 = Ridge(alpha=0.1).fit(X_train,y_train) plt.plot(ridge01.coef_,'o',label='ridge01 alpha=0.1') plt.legend(ncol=2,loc=(0,1.05)) plt.ylim(-1000,750) plt.xlabel('Coefficient index') plt.ylabel("Coefficient magnitude") plt.show()以上結果說明:
數據特徵比較多,並且有一小部分真正重要,用套索回歸,否則用嶺回歸。數據和方法。
6. 用sklearn數據測試所有線性模型建立文件machinelearn_data_model.py。
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegressionfrom sklearn.linear_model import LogisticRegressionfrom sklearn.linear_model import Ridgefrom sklearn.linear_model import Lassofrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.datasets import make_regressionfrom sklearn.model_selection import train_test_splitfrom sklearn import datasetsfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import LinearSVRimport statsmodels.api as sm class data_for_model:def machine_learn(data,model):if data == "iris": mydata = datasets.load_iris()elif data == "wine": mydata = datasets.load_wine()elif data == "breast_cancer": mydata = datasets.load_breast_cancer()elif data == "diabetes": mydata = datasets.load_diabetes()elif data == "boston": mydata = datasets.load_boston()elif data != "two_moon": return "提供的數據錯誤,包括:iris,wine,barest_cancer,diabetes,boston,two_moon"if data == "two_moon": X,y = datasets.make_moons(n_samples=200,noise=0.05, random_state=0)elif model == "DecisionTreeClassifie"or "RandomForestClassifie": X,y = mydata.data[:,:2],mydata.targetelse: X,y = mydata.data,mydata.target X_train,X_test,y_train,y_test = train_test_split(X, y) if model == "LinearRegression": md = LinearRegression().fit(X_train,y_train) elif model == "LogisticRegression": if data == "boston": y_train = y_train.astype('int') y_test = y_test.astype('int') md = LogisticRegression().fit(X_train,y_train) elif model == "Ridge": md = Ridge(alpha=0.1).fit(X_train,y_train) elif model == "Lasso": md = Lasso(alpha=0.0001,max_iter=10000000).fit(X_train,y_train) elif model == "SVM": md = LinearSVR(C=2).fit(X_train,y_train) elif model == "sm": md = sm.OLS(y,X).fit()else: return "提供的模型錯誤,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm " if model == "sm": print("results.params(diabetes):\n",md.params, "\nresults.summary(diabetes):\n",md.summary()) else: print("模型:",model,"數據:",data,"訓練集評分:{:.2%}".format(md.score(X_train,y_train))) print("模型:",model,"數據:",data,"測試集評分:{:.2%}".format(md.score(X_test,y_test)))
在這裡考慮:LogisticRegression算法在波士頓房價下要求目標y必須為int類型,所以做了判斷;
Ridge 算法的alpha參數為0.1;
Lasso算法的alpha參數為0.0001, 最大迭代數為10,000,000
這樣,我們就可以對指定模型指定數據進行定量分析
from machinelearn_data_model import data_for_modeldef linear_for_all_data_and_model(): datas = ["iris","wine","breast_cancer","diabetes","boston","two_moon"] models = ["LinearRegression","LogisticRegression","Ridge","Lasso","SVM","sm"] for data in datas: for model in models: data_for_model.machine_learn(data,model)
我們對測試結果進行比較:數據
模型
訓練集
測試集
鳶尾花
LinearRegression
92.7%
93.7%
LogisticRegression
96.4%
100%
Ridge
93.1%
92.8%
Lasso
92.8%
93.1%
StatsModels OLS
0.972
鳶尾花數據在所有訓練模型下表現均很好
紅酒
LinearRegression
90.6%
85.1%
LogisticRegression
97.7%
95.6%
Ridge
90.2%
86.8%
Lasso
91.0%
85.2%
StatsModels OLS
0.948
紅酒數據在所有訓練模型下表現均很好,但比鳶尾花略差些
乳腺癌
LinearRegression
79.1%
68.9%
LogisticRegression
95.3%
93.0%
Ridge
75.7%
74.5%
Lasso
77.6%
71.4%
StatsModels OLS
0.908
乳腺癌數據僅在邏輯回歸和OLS模型上表現很好
糖尿病
LinearRegression
52.5%
47.9%
LogisticRegression
02.7%
00.0%
Ridge
51.5%
49.2%
Lasso
51.5%
50.2%
StatsModels OLS
0.106
糖尿病數據在所有模型下表現均不好
波士頓房價
LinearRegression
74.5%
70.9%
LogisticRegression
20.8%
11.0%
Ridge
76.0%
62.7%
Lasso
73.5%
74.5%
StatsModels OLS
0.959
波士頓房價數據僅在OLS模型上表現很好,在其他模型下表現均不佳。但是處理邏輯回歸模型下,表現比糖尿病略好。
2個月亮
LinearRegression
66.9%
63.0%
LogisticRegression
89.3%
86.0%
Ridge
66.3%
64.3%
Lasso
65.3%
65.2%
StatsModels OLS
0.501
2個月亮數據在LogisticRegressio模型下表現最好,其他表現不太好。
總結如下表(綠色表現好,紅色表現不好,紫色一般):
數據類型
LinearRegression
LogisticRegression
Ridge
Lasso
OLS
鳶尾花
紅酒
乳腺癌
糖尿病
波士頓房價
2個月亮
我們最後把KNN算法也捆綁進去,machinelearn_data_model.py經過如下改造
…elif model == "KNeighborsClassifier":if data == "boston":y_train = y_train.astype('int')y_test = y_test.astype('int')md = KNeighborsClassifier().fit(X_train,y_train)else: return "提供的模型錯誤,包括:LinearRegression,LogisticRegression,Ridge,Lasso,SVM,sm,KNeighborsClassifier"…
調用測試程序from machinelearn_data_model import data_for_modeldef KNN_for_all_data_and_model(): datas = ["iris","wine","breast_cancer","diabetes","boston","two_moon"] models = ["KNeighborsClassifier"] for data in datas: for model in models: data_for_model.machine_learn(data,model)
得到測試結果數據
模型
訓練集
測試集
鳶尾花
KNeighborsClassifier
95.5%
97.4%
紅酒
KNeighborsClassifier
76.7%
68.9%
乳腺癌
KNeighborsClassifier
95.3%
90.2%
糖尿病
KNeighborsClassifier
19.6%
00.0%
波士頓房價
KNeighborsClassifier
36.4%
04.7%
2個月亮
KNeighborsClassifier
100.00%
100.00%
由此可見,KNeighborsClassifier對鳶尾花,乳腺癌和2個月亮數據是比較有效的。
數據類型
KNeighborsClassifier
鳶尾花
紅酒
乳腺癌
糖尿病
波士頓房價
2個月亮
——————————————————————————————————
顧老師課程歡迎報名
軟體安全測試
https://study.163.com/course/courseMain.htm?courseId=1209779852&share=2&shareId=480000002205486
接口自動化測試
https://study.163.com/course/courseMain.htm?courseId=1209794815&share=2&shareId=480000002205486
DevOps和Jenkins之DevOps
https://study.163.com/course/courseMain.htm?courseId=1209817844&share=2&shareId=480000002205486
DevOps與Jenkins 2.0之詹金斯
https://study.163.com/course/courseMain.htm?courseId=1209819843&share=2&shareId=480000002205486
硒自動化測試
https://study.163.com/course/courseMain.htm?courseId=1209835807&share=2&shareId=480000002205486
性能測試第1季:性能測試基礎知識
https://study.163.com/course/courseMain.htm?courseId=1209852815&share=2&shareId=480000002205486
性能測試第2季:LoadRunner12使用
https://study.163.com/course/courseMain.htm?courseId=1209980013&share=2&shareId=480000002205486
性能測試第3季:JMeter工具使用
https://study.163.com/course/courseMain.htm?courseId=1209903814&share=2&shareId=480000002205486
性能測試第4季:監控與調優
https://study.163.com/course/courseMain.htm?courseId=1209959801&share=2&shareId=480000002205486
Django入門
https://study.163.com/course/courseMain.htm?courseId=1210020806&share=2&shareId=480000002205486
啄木鳥顧老師漫談軟體測試
https://study.163.com/course/courseMain.htm?courseId=1209958326&share=2&shareId=480000002205486