數據分析打工人常用NumPy 70個高頻操作(下篇)

2021-02-20 pythonic生物人
36、求numpy.ndarray兩列相關係數
37、判斷numpy.ndarray中是否有null值
38、使用指定值替代numpy.ndarray中的預設值
39、計算numpy.ndarray元素頻率
40、將numpy.ndarray元素由數值型轉換為分類型
41、由numpy.ndarray已知列得到新列
42、numpy.ndarray概率抽樣
43、numpy.ndarray按某個指標分類後求第二大的元素
44、通過numpy.ndarray某一列排序
45、挑選numpy.ndarray中頻數最高的元素
46、輸出numpy.ndarray中第一次大於給定元素的位置
47、使用給定值替換numpy.ndarray中滿足條件的元素
48、獲取numpy.ndarray中大小排前n的元素位置、元素
49、求numpy.ndarray的row wise counts
50、多個numpy.ndarray合成一個
51、計算numpy.ndarray的one-hot encodings numpy.ndarray
52、create row numbers grouped by a categorical variable
53、create groud ids based on a given categorical variable
54、numpy.ndarray（一維）元素rank
55、numpy.ndarray（多維）元素rank
56、輸出numpy.ndarray每行的最大元素
57、輸出numpy.ndarray每行的最小值與最大值比值
58、判斷numpy.ndarray中元素是否是第一次出現
59、求numpy.ndarray中每組元素的均值
60、將PIL image轉換為numpy.ndarray
61、丟棄numpy.ndarray中所有預設值
62、計算兩個numpy.ndarray的歐幾裡得距離
63、求numpy.ndarray的局部最大值位置
64、numpy.ndarray減法運算
65、輸出numpy.ndarray中元素第n次重複的位置
66、numpy.ndarray數據格式從datetime64轉換為datetime
67、計算numpy.ndarray數據窗口大小
68、指定起始、終止、步長，構建numpy.ndarray
69、補齊非連續時間序列numpy.ndarray
70、構建按指定步長滑窗的numpy.ndarray

36、求numpy.ndarray兩列相關係數url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

#方法1
np.corrcoef(iris[:, 0], iris[:, 2])[0, 1]

#方法2
from scipy.stats.stats import pearsonr  
corr, p_value = pearsonr(iris[:, 0], iris[:, 2])
print(corr)
37、判斷numpy.ndarray中是否有null值url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

np.isnan(iris_2d).any()
38、使用指定值替代numpy.ndarray中的預設值url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan


iris_2d[np.isnan(iris_2d)] = 0#使用0替代預設值
iris_2d[:4]
39、計算numpy.ndarray元素頻率url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

species = np.array([row.tolist()[4] for row in iris])

# Get the unique values and the counts
np.unique(species, return_counts=True)
40、將numpy.ndarray元素由數值型轉換為分類型'''
需求：
Less than 3 --> 'small'
3-5 --> 'medium'
'>=5 --> 'large'
'''

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Bin petallength 
petal_length_bin = np.digitize(iris[:, 2].astype('float'), [0, 3, 5, 10])

# Map it to respective category
label_map = {1: 'small', 2: 'medium', 3: 'large', 4: np.nan}
petal_length_cat = [label_map[x] for x in petal_length_bin]

# View
petal_length_cat[:4]
41、由numpy.ndarray已知列得到新列url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')

#計算新列
sepallength = iris_2d[:, 0].astype('float')
petallength = iris_2d[:, 2].astype('float')
volume = (np.pi * petallength * (sepallength**2))/3

# 轉換為iris_2d大小
volume = volume[:, np.newaxis]

#添加新列
out = np.hstack([iris_2d, volume])
out[:4]
42、numpy.ndarray概率抽樣#需求：抽樣結果使得species中setose is twice the number of versicolor and virginica
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Get the species column
species = iris[:, 4]

#方法1
np.random.seed(100)
a = np.array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
species_out = np.random.choice(a, 150, p=[0.5, 0.25, 0.25])

#方法2
np.random.seed(100)
probs = np.r_[np.linspace(0, 0.500, num=50), np.linspace(0.501, .750, num=50), np.linspace(.751, 1.0, num=50)]
index = np.searchsorted(probs, np.random.random(150))
species_out = species[index]
print(np.unique(species_out, return_counts=True))
43、numpy.ndarray按某個指標分類後求第二大的元素url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Get the species and petal length columns
petal_len_setosa = iris[iris[:, 4] == b'Iris-setosa', [2]].astype('float')

# Get the second last value
np.unique(np.sort(petal_len_setosa))[-2]
44、通過numpy.ndarray某一列排序import numpy as np
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')
print(iris[iris[:,0].argsort()][:20])#按第一列排序
45、挑選numpy.ndarray中頻數最高的元素url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

vals, counts = np.unique(iris[:, 2], return_counts=True)
print(vals[np.argmax(counts)])
46、輸出numpy.ndarray中第一次大於給定元素的位置url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

np.argwhere(iris[:, 3].astype(float) > 1.0)[0]
47、使用給定值替換numpy.ndarray中滿足條件的元素#需求：numpy.ndarray中大於30的用30替換、小於10的用10替換
np.set_printoptions(precision=2)
np.random.seed(100)
a = np.random.uniform(1,50, 20)

#方法1
np.clip(a, a_min=10, a_max=30)

#方法2
print(np.where(a < 10, 10, np.where(a > 30, 30, a)))
48、獲取numpy.ndarray中大小排前n的元素位置、元素np.random.seed(100)
a = np.random.uniform(1,50, 20)

##獲取numpy.ndarray中大小排前5的元素位置
#方法1
print(a.argsort())

#方法2
np.argpartition(-a, 5)[:5]

##獲取numpy.ndarray中大小排前5的元素
#方法1
a[a.argsort()][-5:]

#方法2
np.sort(a)[-5:]

#方法3
np.partition(a, kth=-5)[-5:]

#方法4
a[np.argpartition(-a, 5)][:5]
49、求numpy.ndarray的row wise countsnp.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
print(arr)
def counts_of_all_values_rowwise(arr2d):
    # Unique values and its counts row wise
    num_counts_array = [np.unique(row, return_counts=True) for row in arr2d]

    # Counts of all values row wise
    return([[int(b[a==i]) if i in a else 0 for i in np.unique(arr2d)] for a, b in num_counts_array])

print(np.arange(1,11))
counts_of_all_values_rowwise(arr)
50、多個numpy.ndarray合成一個arr1 = np.arange(3)
arr2 = np.arange(3,7)
arr3 = np.arange(7,10)

array_of_arrays = np.array([arr1, arr2, arr3])
print('array_of_arrays: ', array_of_arrays)

#方法
arr_2d = np.array([a for arr in array_of_arrays for a in arr])

#方法2
arr_2d = np.concatenate(array_of_arrays)
print(arr_2d)
51、計算numpy.ndarray的one-hot encodings numpy.ndarraynp.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
print(arr)

# Solution:
def one_hot_encodings(arr):
    uniqs = np.unique(arr)
    out = np.zeros((arr.shape[0], uniqs.shape[0]))
    for i, k in enumerate(arr):
        out[i, k-1] = 1
    return out

one_hot_encodings(arr)
52、create row numbers grouped by a categorical variableurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
species = np.genfromtxt(url, delimiter=',', dtype='str', usecols=4)
np.random.seed(100)
species_small = np.sort(np.random.choice(species, size=20))
print(species_small)

print([i for val in np.unique(species_small) for i, grp in enumerate(species_small[species_small==val])])
53、create groud ids based on a given categorical variableurl = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
species = np.genfromtxt(url, delimiter=',', dtype='str', usecols=4)
np.random.seed(100)
species_small = np.sort(np.random.choice(species, size=20))
print(species_small)
output = [np.argwhere(np.unique(species_small) == s).tolist()[0][0] for val in np.unique(species_small) for s in species_small[species_small==val]]
output
54、numpy.ndarray（一維）元素ranknp.random.seed(10)
a = np.random.randint(20, size=10)
print('Array: ', a)


print(a.argsort().argsort())
55、numpy.ndarray（多維）元素ranknp.random.seed(10)
a = np.random.randint(20, size=[2,5])
print(a)

print(a.ravel().argsort().argsort().reshape(a.shape))
56、輸出numpy.ndarray每行的最大元素np.random.seed(100)
a = np.random.randint(1,10, [5,3])
print(a)

# 方法1
np.amax(a, axis=1)

#方法2
np.apply_along_axis(np.max, arr=a, axis=1)
57、輸出numpy.ndarray每行的最小值與最大值比值np.random.seed(100)
a = np.random.randint(1,10, [5,3])
print(a)

np.apply_along_axis(lambda x: np.min(x)/np.max(x), arr=a, axis=1)
58、判斷numpy.ndarray中元素是否是第一次出現np.random.seed(100)
a = np.random.randint(0, 5, 10)

# There is no direct function to do this as of 1.13.3

# Create an all True array
out = np.full(a.shape[0], True)

# Find the index positions of unique elements
unique_positions = np.unique(a, return_index=True)[1]

# Mark those positions as False
out[unique_positions] = False

print(out)
59、求numpy.ndarray中每組元素的均值url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')


# No direct way to implement this. Just a version of a workaround.
numeric_column = iris[:, 1].astype('float')  # sepalwidth
grouping_column = iris[:, 4]  # species

# List comprehension version
[[group_val, numeric_column[grouping_column==group_val].mean()] for group_val in np.unique(grouping_column)]

# For Loop version
output = []
for group_val in np.unique(grouping_column):
    output.append([group_val, numeric_column[grouping_column==group_val].mean()])

output
60、將PIL image轉換為numpy.ndarrayfrom io import BytesIO
from PIL import Image
import PIL, requests

# Import image from URL
URL = 'https://upload.wikimedia.org/wikipedia/commons/8/8b/Denali_Mt_McKinley.jpg'
response = requests.get(URL)

# Read it as Image
I = Image.open(BytesIO(response.content))

# Optionally resize
I = I.resize([150,150])

# Convert to numpy array
arr = np.asarray(I)

# Optionaly Convert it back to an image and show
im = PIL.Image.fromarray(np.uint8(arr))
Image.Image.show(im)
61、丟棄numpy.ndarray中所有預設值a = np.array([1,2,3,np.nan,5,6,7,np.nan])
print(a)
a[~np.isnan(a)]
62、計算兩個numpy.ndarray的歐幾裡得距離a = np.array([1,2,3,4,5])
b = np.array([4,5,6,7,8])

# Solution
dist = np.linalg.norm(a-b)
dist
63、求numpy.ndarray的局部最大值位置a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
doublediff = np.diff(np.sign(np.diff(a)))
peak_locations = np.where(doublediff == -2)[0] + 1
peak_locations
64、numpy.ndarray減法運算#需求：Subtract the 1d array b_1d from the 2d array a_2d, such that each item of b_1d subtracts from respective row of a_2d.
a_2d = np.array([[3,3,3],[4,4,4],[5,5,5]])
b_1d = np.array([1,2,3])

print(a_2d - b_1d[:,None])
65、輸出numpy.ndarray中元素第n次重複的位置x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])
print(x)
n = 5

#方法1：列表推導式
[i for i, v in enumerate(x) if v == 1][n-1]#輸出元素1第5次重複的位置

#方法2
np.where(x == 1)[0][n-1]
66、numpy.ndarray數據格式從datetime64轉換為datetimedt64 = np.datetime64('2018-02-25 22:10:10')

#方法1
from datetime import datetime
dt64.tolist()


#方法2
dt64.astype(datetime)
67、計算numpy.ndarray數據窗口大小def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

np.random.seed(100)
Z = np.random.randint(10, size=10)
print('array: ', Z)

#方法1
moving_average(Z, n=3).round(2)

#方法2
np.convolve(Z, np.ones(3)/3, mode='valid')  
68、指定起始、終止、步長，構建numpy.ndarraylength = 10
start = 5
step = 3

def seq(start, length, step):
    end = start + (step*length)
    return np.arange(start, end, step)

seq(start, length, step)
69、補齊非連續時間序列numpy.ndarraydates = np.arange(np.datetime64('2018-02-01'), np.datetime64('2018-02-25'), 2)
print(dates)

#方法1
filled_in = np.array([
    np.arange(date, (date + d)) for date, d in zip(dates, np.diff(dates))
]).reshape(-1)

output = np.hstack([filled_in, dates[-1]])
output

#方法2
out = []
for date, d in zip(dates, np.diff(dates)):
    out.append(np.arange(date, (date + d)))

filled_in = np.array(out).reshape(-1)
output = np.hstack([filled_in, dates[-1]])
output
70、構建按指定步長滑窗的numpy.ndarrayimport numpy as np


def gen_strides(a, stride_len=5, window_len=5):
    n_strides = ((a.size - window_len) // stride_len) + 1
    # return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])
    return np.array([
        a[s:(s + window_len)]
        for s in np.arange(0, n_strides * stride_len, stride_len)
    ])


print(gen_strides(np.arange(15), stride_len=2, window_len=4))
數據分析打工人常用NumPy 70個高頻操作(下篇)

相關焦點

數據分析打工人常用NumPy 70個高頻操作(上篇)

Python數據分析 - Numpy

【數據分析】Numpy

Python 數據分析:Numpy 介紹

Python 數據分析:NumPy 基礎知識

python數據分析專題 (9):numpy基礎

學員筆記||Python數據分析之:numpy入門(一)

Python|Numpy的常用操作

從零開始學Python數據分析【4】-- numpy

Excel和Python常用功能的操作(數據分析)

NumPy:創建和操作數值數據

每個數據科學家都應該知道的20個NumPy操作

Python數據分析基礎之NumPy學習 (上)

numpy中數組操作的相關函數

Python數據分析之numpy學習(一)

從numpy開啟Python數據科學之旅

Python數據分析之Numpy學習 2——NumPy 基礎 ndarray對象

再見Numpy,Pandas!又一個數據分析神器橫空出現!

python數據分析:numpy入門

數據分析-numpy庫快速了解