Tensorflow.keras筆記
卷積神經網絡識別Sci-hub驗證碼
使用requests庫和selenium批量下載sci_hub驗證碼數據集。問題考慮:1. 並不是每次輸入doi都能出現需要驗證碼的情況。2. 批量下載會出現檢測出網頁爬蟲檢測情況,需要隨機休息一段時間後下載。3. 下載失敗如何繼續運行,循環運行下載。from selenium import webdriver from selenium.webdriver.common.keys import Keys import reimport requestsimport timeimport numpy as npdef path(sci,doi,start,end,path): browser = webdriver.Chrome(executable_path = "D:/chromedriver.exe") for i in range(start,end): browser.get(sci) jpg = browser.find_element_by_name('request') jpg.send_keys(doi) jpg.send_keys(Keys.ENTER) try: a = sci + re.search(r'(src=")(.*)(.jpg)',browser.page_source ).group(2) + '.jpg' b = requests.get(a, timeout = 30) b.encoding = b.apparent_encoding load = path + 'test' + str(i+1) + '.jpg' #開始下載test數據 with open(load, 'wb') as f: f.write(b.content) print('下載成功 test{}.jpg'.format(str(i+1))) #開始下載test數據 except: print('下載失敗') break time.sleep(np.random.randint(1)) browser.quit() i = i return i a = ['10.2174/1381612820666140825143610' ,'10.1037/ccp0000440', '10.1176/appi.psychotherapy.20190009' ,'10.1037/pas0000715', '10.4149/neo_2019_190131N92' ,'10.1016/j.oftal.2019.04.003', '10.1136/rmdopen-2019-000914' ,'10.1136/rmdopen-2019-001017', '10.1016/j.eimc.2019.01.014' ,'10.1136/bmjopen-2019-030598']k = 1for i in range (len(a)): k = path(sci = r'https://sci-hub.se', doi = a[i], start = k, end = 500, path = r'D:/sci_hub_raw/test/') if k == 500: break處理sci-hub驗證集圖片存在噪聲,並對齊做二值化處理。首先,利用photoshop查詢噪聲與字母的RGB灰度值差異,並按差異條件做二值化處理。其次,批量載入圖片,並保存為npz數組格式。from PIL import Imageimport numpy as npimport pandas as pddef jpg_load(load,width,length,lower,upwer): img = Image.open(load) img = img.convert('L') img = np.array(img) for i in range(width): for j in range(length): if img[i,j] <= upwer and img[i,j] >= lower: img[i,j] = 1 else: img[i,j] = 0 return imgload = 'D:/1_數據/Python/Tensorflow學習/數據/sci_hub_1/sci_hub_raw'train_path0 = ['1_500' ,'501_1000' ,'1001_1500','1501_2000','2001_2500', '2501_3000','3001_3500','3501_4000','4001_4500','4501_5000', '5001_5500','5501_6000','6001_6500','6501_7000','7001_7500', '7501_8000','8001_8500','8501_9000','9001_9500','9501_10000']train_path1 = np.arange(10000).reshape(20,500) + 1train = np.empty((10000,200,800))for i in range(20): for j in range(500): a = train_path0[i] b = train_path1[i,j] c = b -1 path = load + '/' + str(a) + '/' + str(b) + '.jpg' img = jpg_load(path,200,800,115,125) img = img.reshape(1,200,800) train[c] = img if b%10 == 0: print('載入第{0}張圖片'.format(b))np.savez('D:/sci_hub_op/train.npz',train_x = train)CNN建模:sci-hub圖片驗證碼共有6個字母,屬於單輸入多輸出情況。
PS:該模型為學習後測試模型,重點在於我可以完整跑完正個流程,雖然後面結果也不錯(圖片特徵比較簡單吧)。0.01. 載入圖片數據和結局數據,並修改成可輸入模型格式。2. 建模:五層卷積,卷積層分別為32/64/128/256/512,卷積核3*3,relu激活函數,最大池化2*2,全連接層直接輸出六次結果。3. 預測:圖片六個字母全預測正確才算完整預測出一張驗證碼。import pandas as pdimport numpy as npimport tensorflow as tffrom tensorflow import kerasimport osfrom matplotlib import pyplot as pltfrom tensorflow.keras.layers import Conv2D,BatchNormalization,Activation,MaxPool2D,Dropout,Flatten,Densefrom tensorflow.keras import Modelfrom tensorflow.keras.layers import Input,MaxPooling2Dnp.set_printoptions(threshold=np.inf)train = np.load('D:/sci_hub_op/train.npz')train_x = train['train_x']train_y = train['train_y']train_x = train_x.astype('float32')train_x = train_x.reshape(10000,200,800,1)b = [train_y[0],train_y[1],train_y[2],train_y[3],train_y[4],train_y[5]]input_x = Input(shape=(200,800,1))x = input_x x = Conv2D(filters=32,kernel_size=(3,3),activation='relu',padding = 'same')(x)x = BatchNormalization()(x)x = Activation('relu')(x)x = MaxPooling2D((2,2),padding='same')(x) x = Conv2D(filters=64,kernel_size=(3,3),activation='relu',padding = 'same')(x)x = BatchNormalization()(x)x = Activation('relu')(x)x = MaxPooling2D((2,2),padding='same')(x) x = Conv2D(filters=128,kernel_size=(3,3),activation='relu',padding = 'same')(x)x = BatchNormalization()(x)x = Activation('relu')(x)x = MaxPooling2D((2,2),padding='same')(x) x = Conv2D(filters=256,kernel_size=(3,3),activation='relu',padding = 'same')(x)x = BatchNormalization()(x)x = Activation('relu')(x)x = MaxPooling2D((2,2),padding='same')(x) x = Conv2D(filters=512,kernel_size=(3,3),activation='relu',padding = 'same')(x)x = BatchNormalization()(x)x = Activation('relu')(x)x = MaxPooling2D((2,2),padding='same')(x)x = Dropout(0.2)(x) x = Flatten()(x)x = [Dense(26,activation='softmax',name='c%d' % (i+1))(x) for i in range(6)] model = Model(inputs=input_x, outputs=x)model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])model.fit(train_x,b,batch_size=1,epochs=2,validation_split=0.1)
word = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p', 'q','r','s','t','u','v','w','x','y','z']letter = {}for i, v in enumerate(word): letter[v] = iletter_T = dict(zip(letter.values(),letter.keys()))result = []for i in range(10000): x_predict = train_x[i].reshape(1,200,800,1) c = model.predict(x_predict) c = np.array(c).reshape(6,26) c = c.argmax(axis=1) if i % 100 == 0: print(i) result.append(c)result = np.array(result)
y_result = train_y.transpose(1,0,2)y_result = y_result.argmax(axis = 2)y_result = y_result[0:10000,:]d = []for i in range(10000): if list(result[i]) == list(y_result[i]): e = 1 d.append(e) else: e = 0 d.append(e)d = np.array(d)sum(d)f = []for j in range(10000): c = '' for i in range(6): a = result[j,i] b = letter_T[a] c = c + b f.append(c) print('第{0}個字母是:{1}'.format(j+1,c))f = pd.DataFrame(f)