MaskRCNN源碼解讀

2021-03-06 horizonheart

https://github.com/matterport/Mask_RCNN
這個是一個基於Keras寫的maskrcnn的源碼,作者寫的非常nice。沒有多餘的問文件,源碼都放在mrcnn中,readme裡面有詳細的介紹,

為了了解maskrcnn的運行流程,最好的辦法就是將代碼邊運行邊調試。從samples下面的coco文件開始運行:

前面首先會加載一些配置文件的數據,暫時用不到沒必要去具體看這些配置文件,等到用到的時候,再回頭去看這些細節。

MaskRCNN的整體流程圖為:

首先要看的就是模型的搭建,也就是maskrcnn的基礎網絡結構。

1--模型搭建

跳轉到對應的方法中

1.1--在build方法中,首先要創建一些模型的輸入變量。

由於此時我們並不知道輸入變量的具體大小,因此我們可以使用Keras的變量進行佔位操作,等到具體訓練的時候根據傳入的參數確定變量的大小。train和inference的時候,都是先build的模型,因此他們共有一個模型,inference的時候,需要的輸入變量比較少。gt_boxes使用了歸一化的操作,也就是所有的邊框的坐標都是0到1之間,這樣做的好處是避免數值大小帶來的預測誤差。

1.2 --搭建backbone網絡

搭建特徵提取網絡用於圖片的特徵的提取,maskrcnn使用了金字塔(FPN)的網絡方式進行特徵的提取,使用的網絡是resnet101 ,具體的網絡的可視化可以參考:http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50

然後利用不同的特徵去做不同的事情,代碼中的使用方式如下,P2,P3,P4,P5,P6用於rpn網絡提取信息。由於上面都進行了3*3的卷積操作,保證了不同層的特徵的channel數目一樣。

有了這些特徵圖,就可以用這些特徵圖生成anchor。

1.3--RPN網絡

搭建rpn網絡,將1.2步得到的金字塔特徵分別輸入到rpn網絡中,得到網絡的分類和回歸值。

rpn_logits: [batch, H, W, 2] Anchor classifier logits (before softmax)
rpn_probs: [batch, W, W, 2] Anchor classifier probabilities.
rpn_bbox: [batch, H, W, (dy, dx, log(dh), log(dw))] Deltas to be
applied to anchors.

1.4--ProposalLayer

將rpn網路的輸出應用到1.2步得到的anchors,首先對輸出的概率進行排序,保留其中預測為前景色概率大的一部分(具體值可以在配置文件中進行配置),然後選取想對應的anchor,利用rpn的輸出回歸值對anchor進行第一次修正。修正完利用極大抑制方法,刪除其中的一部分anchor。獲的最後的anchor。

#ProposalLayer的作用主要
# 1. 根據rpn網絡,獲取score靠前的前6000個anchor
# 2. 利用rpn_bbox對anchors進行修正
# 3. 捨棄掉修正後邊框超過圖片大小的anchor,由於我們的anchor的坐標的大小是歸一化的,只要坐標不超過0 1區間即可
# 4. 利用非極大抑制的方法獲得最後的anchor

1.5--DetectionTargetLayer

DetectionTargetLayer的輸入包含了,target_rois, input_gt_class_ids, gt_boxes, input_gt_masks。其中target_rois是ProposalLayer輸出的結果。首先,計算target_rois中的每一個rois和哪一個真實的框gt_boxes iou值,如果最大的iou大於0.5,則被認為是正樣本,負樣本是是iou小於0.5並且和crowd box相交不大的anchor,選擇出了正負樣本,還要保證樣本的均衡性,具體可以才配置文件中進行配置。最後計算了正樣本中的anchor和哪一個真實的框最接近,用真實的框和anchor計算出偏移值,並且將mask的大小resize成28*28的(我猜測利用的是雙線性差值的方式,因為mask的值不是0就是1,0是背景,一是前景)這些都是後面的分類和mask網絡要用到的真實的值。下面是該層的主要代碼:

獲取的就是每個rois和哪個真實的框最接近,計算出和真實框的距離,以及要預測的mask,這些信息都會在網絡的頭的classify和mask網絡
#所使用
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
"""Generates detection targets for one image. Subsamples proposals and
generates target class IDs, bounding box deltas, and masks for each.

Inputs:
proposals: [N, (y1, x1, y2, x2)] in normalized coordinates. Might
be zero padded if there are not enough proposals.
gt_class_ids: [MAX_GT_INSTANCES] int class IDs
gt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] in normalized coordinates.
gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type.

Returns: Target ROIs and corresponding class IDs, bounding box shifts,
and masks.
rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates
class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded.
deltas: [TRAIN_ROIS_PER_IMAGE, NUM_CLASSES, (dy, dx, log(dh), log(dw))]
Class-specific bbox refinements.
masks: [TRAIN_ROIS_PER_IMAGE, height, width). Masks cropped to bbox
boundaries and resized to neural network output size.

Note: Returned arrays might be zero padded if not enough target ROIs.
"""
# Assertions
asserts = [
tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals],
name="roi_assertion"),
]
with tf.control_dependencies(asserts):
proposals = tf.identity(proposals)

# Remove zero padding
proposals, _ = trim_zeros_graph(proposals, name="trim_proposals")
#去除非零的真實的框,也就是只留下真實存在的有意義的框
gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes")
gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros,
name="trim_gt_class_ids")
gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2,
name="trim_gt_masks")

# Handle COCO crowds
# A crowd box in COCO is a bounding box around several instances. Exclude
# them from training. A crowd box is given a negative class ID.
#在coco數據集中,有的框會標註很多的物體,在訓練中,去掉這些框
crowd_ix = tf.where(gt_class_ids < 0)[:, 0]
non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0]
crowd_boxes = tf.gather(gt_boxes, crowd_ix)
crowd_masks = tf.gather(gt_masks, crowd_ix, axis=2)
#下面就是一張圖片中真實存在的物體用於訓練
gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix)
gt_boxes = tf.gather(gt_boxes, non_crowd_ix)
gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)

# Compute overlaps matrix [proposals, gt_boxes]
#計算iou的值
overlaps = overlaps_graph(proposals, gt_boxes)

# Compute overlaps with crowd boxes [anchors, crowds]
crowd_overlaps = overlaps_graph(proposals, crowd_boxes)
crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1)
no_crowd_bool = (crowd_iou_max < 0.001)

# Determine positive and negative ROIs
roi_iou_max = tf.reduce_max(overlaps, axis=1)
# 1. Positive ROIs are those with >= 0.5 IoU with a GT box
#和真實的框的iou值大於0.5時,被認為是正樣本
positive_roi_bool = (roi_iou_max >= 0.5)
positive_indices = tf.where(positive_roi_bool)[:, 0]
# 2. Negative ROIs are those with < 0.5 with every GT box. Skip crowds.
#負樣本是是iou小於0.5並且和crowd box相交不大的anchor
negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0]

# Subsample ROIs. Aim for 33% positive
# Positive ROIs
positive_count = int(config.TRAIN_ROIS_PER_IMAGE *
config.ROI_POSITIVE_RATIO)
positive_indices = tf.random_shuffle(positive_indices)[:positive_count]
positive_count = tf.shape(positive_indices)[0]
# Negative ROIs. Add enough to maintain positive:negative ratio.
r = 1.0 / config.ROI_POSITIVE_RATIO
negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count
negative_indices = tf.random_shuffle(negative_indices)[:negative_count]
# Gather selected ROIs
#選擇出正負樣本
positive_rois = tf.gather(proposals, positive_indices)
negative_rois = tf.gather(proposals, negative_indices)

# Assign positive ROIs to GT boxes.
#計算正樣本和哪個真實的框最接近
positive_overlaps = tf.gather(overlaps, positive_indices)
roi_gt_box_assignment = tf.cond(
tf.greater(tf.shape(positive_overlaps)[1], 0),
true_fn = lambda: tf.argmax(positive_overlaps, axis=1),
false_fn = lambda: tf.cast(tf.constant([]),tf.int64)
)
roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment)
roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment)

# Compute bbox refinement for positive ROIs
#用最接近的真實框修正rpn網絡預測的框
deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes)
deltas /= config.BBOX_STD_DEV

# Assign positive ROIs to GT masks
# Permute masks to [N, height, width, 1]
transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1)
# Pick the right mask for each ROI
# 計算和每一個rois最接近的框的mask
roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment)

# Compute mask targets
boxes = positive_rois
if config.USE_MINI_MASK:
# Transform ROI coordinates from normalized image space
# to normalized mini-mask space.
y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
gt_h = gt_y2 - gt_y1
gt_w = gt_x2 - gt_x1
y1 = (y1 - gt_y1) / gt_h
x1 = (x1 - gt_x1) / gt_w
y2 = (y2 - gt_y1) / gt_h
x2 = (x2 - gt_x1) / gt_w
boxes = tf.concat([y1, x1, y2, x2], 1)
box_ids = tf.range(0, tf.shape(roi_masks)[0])
# crop_and_resize相當於roipolling的操作
masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
box_ids,
config.MASK_SHAPE)
# Remove the extra dimension from masks.
masks = tf.squeeze(masks, axis=3)

# Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with
# binary cross entropy loss.
masks = tf.round(masks)

# Append negative ROIs and pad bbox deltas and masks that
# are not used for negative ROIs with zeros.
rois = tf.concat([positive_rois, negative_rois], axis=0)
N = tf.shape(negative_rois)[0]
P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0)
rois = tf.pad(rois, [(0, P), (0, 0)])
roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)])
roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)])
deltas = tf.pad(deltas, [(0, N + P), (0, 0)])
masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)])

return rois, roi_gt_class_ids, deltas, masks

最後返回的是:

rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates
class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded.
deltas: [TRAIN_ROIS_PER_IMAGE, NUM_CLASSES, (dy, dx, log(dh), log(dw))]
Class-specific bbox refinements.
masks: [TRAIN_ROIS_PER_IMAGE, height, width). Masks cropped to bbox
boundaries and resized to neural network output size.

通過rpn網絡得到的anchor,選擇出來正負樣本,並計算出正樣本和真實框的差距,以及要預測的mask的值,這些都是在後面的網絡中計算損失函數需要的真實值。

1.6--Feature Pyramid Network Heads(fpn_classifier_graph)

該網絡是maskrcnn的最後一層,與之並行的還有一個mask分支,在這裡先介紹一下這個分類網絡。

由1.5得到的roi的大小並不一樣,因此,需要一個網絡類似於fasterrcnn中的roipooling,將rois轉換成大小一樣的特徵圖。maskrcnn中使用的是PyramidROIAlign。PyramidROIAlign首先根據下面的公司計算每一個roi來自於金字塔特徵的P2到P5的哪一層的特徵:

# Assign each ROI to a level in the pyramid based on the ROI area.
y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)
h = y2 - y1
w = x2 - x1
# Use shape of first image. Images in a batch must have the same size.
image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]
# Equation 1 in the Feature Pyramid Networks paper. Account for
# the fact that our coordinates are normalized here.
# e.g. a 224x224 ROI (in pixels) maps to P4
#計算每個roi映射到哪一層的金字塔特徵的輸出上
image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))
roi_level = tf.minimum(5, tf.maximum(
2, 4 + tf.cast(tf.round(roi_level), tf.int32)))
roi_level = tf.squeeze(roi_level, 2)

然後從對應的特徵圖中取出坐標對應的區域,利用雙線性插值的方式進行pooling操作。PyramidROIAlign會返回resize成相同大小的rois。

將得到的特徵塊輸入到fpn_classifier_graph網絡中,得到分類和回歸值。

下面是fpn_classifier_graph網絡的定義,

def fpn_classifier_graph(rois, feature_maps, image_meta,
pool_size, num_classes, train_bn=True,
fc_layers_size=1024):
"""Builds the computation graph of the feature pyramid network classifier
and regressor heads.

rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalized
coordinates.
feature_maps: List of feature maps from different layers of the pyramid,
[P2, P3, P4, P5]. Each has a different resolution.
- image_meta: [batch, (meta data)] Image details. See compose_image_meta()
pool_size: The width of the square feature map generated from ROI Pooling.
num_classes: number of classes, which determines the depth of the results
train_bn: Boolean. Train or freeze Batch Norm layers
fc_layers_size: Size of the 2 FC layers

Returns:
logits: [N, NUM_CLASSES] classifier logits (before softmax)
probs: [N, NUM_CLASSES] classifier probabilities
bbox_deltas: [N, (dy, dx, log(dh), log(dw))] Deltas to apply to
proposal boxes
"""
# ROI Pooling
# Shape: [batch, num_boxes, pool_height, pool_width, channels]
x = PyramidROIAlign([pool_size, pool_size],
name="roi_align_classifier")([rois, image_meta] + feature_maps)
# Two 1024 FC layers (implemented with Conv2D for consistency)
x = KL.TimeDistributed(KL.Conv2D(fc_layers_size, (pool_size, pool_size), padding="valid"),
name="mrcnn_class_conv1")(x)
x = KL.TimeDistributed(BatchNorm(), name='mrcnn_class_bn1')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(fc_layers_size, (1, 1)),
name="mrcnn_class_conv2")(x)
x = KL.TimeDistributed(BatchNorm(), name='mrcnn_class_bn2')(x, training=train_bn)
x = KL.Activation('relu')(x)

shared = KL.Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2),
name="pool_squeeze")(x)

# Classifier head
mrcnn_class_logits = KL.TimeDistributed(KL.Dense(num_classes),
name='mrcnn_class_logits')(shared)
mrcnn_probs = KL.TimeDistributed(KL.Activation("softmax"),
name="mrcnn_class")(mrcnn_class_logits)

# BBox head
# [batch, boxes, num_classes * (dy, dx, log(dh), log(dw))]
x = KL.TimeDistributed(KL.Dense(num_classes * 4, activation='linear'),
name='mrcnn_bbox_fc')(shared)
# Reshape to [batch, boxes, num_classes, (dy, dx, log(dh), log(dw))]
s = K.int_shape(x)
mrcnn_bbox = KL.Reshape((s[1], num_classes, 4), name="mrcnn_bbox")(x)

return mrcnn_class_logits, mrcnn_probs, mrcnn_bbox

返回值為:

Returns:
logits: [N, NUM_CLASSES] classifier logits (before softmax)
probs: [N, NUM_CLASSES] classifier probabilities
bbox_deltas: [N, (dy, dx, log(dh), log(dw))] Deltas to apply to
proposal boxes

1.7--build_fpn_mask_graph

mask網絡的輸入和1.6網絡的輸入值是一樣的,也會經過PyramidROIAlign(這個地方可以進行一個提取,放在最後的網絡之前)

#創建mask的分類頭
def build_fpn_mask_graph(rois, feature_maps, image_meta,
pool_size, num_classes, train_bn=True):
"""Builds the computation graph of the mask head of Feature Pyramid Network.

rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalized
coordinates.
feature_maps: List of feature maps from different layers of the pyramid,
[P2, P3, P4, P5]. Each has a different resolution.
image_meta: [batch, (meta data)] Image details. See compose_image_meta()
pool_size: The width of the square feature map generated from ROI Pooling.
num_classes: number of classes, which determines the depth of the results
train_bn: Boolean. Train or freeze Batch Norm layers

Returns: Masks [batch, roi_count, height, width, num_classes]
"""
# ROI Pooling
# Shape: [batch, boxes, pool_height, pool_width, channels]
x = PyramidROIAlign([pool_size, pool_size],
name="roi_align_mask")([rois, image_meta] + feature_maps)

# Conv layers
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv1")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn1')(x, training=train_bn)
x = KL.Activation('relu')(x)

x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv2")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn2')(x, training=train_bn)
x = KL.Activation('relu')(x)

x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv3")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn3')(x, training=train_bn)
x = KL.Activation('relu')(x)

x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv4")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn4')(x, training=train_bn)
x = KL.Activation('relu')(x)

x = KL.TimeDistributed(KL.Conv2DTranspose(256, (2, 2), strides=2, activation="relu"),
name="mrcnn_mask_deconv")(x)
x = KL.TimeDistributed(KL.Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid"),
name="mrcnn_mask")(x)
return x

有一個細節需要注意的就是1.6經過PyramidROIAlign得到的特徵圖是7*7大小的,二1.7經過PyramidROIAlign得到的特徵圖大小是14*14

最後的返回值是:

Returns: Masks [batch, roi_count, height, width, num_classes]

1.8 --損失函數

maskrcnn中總共有五個損失函數,分別是rpn網絡的兩個損失,分類的兩個損失,以及mask分支的損失函數。前四個損失函數與fasterrcnn的損失函數一樣,最後的mask損失函數的採用的是mask分支對於每個RoI有Km2 維度的輸出。K個(類別數)解析度為m*m的二值mask。 
因此作者利用了a per-pixel sigmoid,並且定義 Lmask 為平均二值交叉熵損失(the average binary cross-entropy loss). 
對於一個屬於第k個類別的RoI, Lmask 僅僅考慮第k個mask(其他的掩模輸入不會貢獻到損失函數中)。這樣的定義會允許對每個類別都會生成掩模,並且不會存在類間競爭。

  # Losses
rpn_class_loss = KL.Lambda(lambda x: rpn_class_loss_graph(*x), name="rpn_class_loss")(
[input_rpn_match, rpn_class_logits])
rpn_bbox_loss = KL.Lambda(lambda x: rpn_bbox_loss_graph(config, *x), name="rpn_bbox_loss")(
[input_rpn_bbox, input_rpn_match, rpn_bbox])
class_loss = KL.Lambda(lambda x: mrcnn_class_loss_graph(*x), name="mrcnn_class_loss")(
[target_class_ids, mrcnn_class_logits, active_class_ids])
bbox_loss = KL.Lambda(lambda x: mrcnn_bbox_loss_graph(*x), name="mrcnn_bbox_loss")(
[target_bbox, target_class_ids, mrcnn_bbox])
mask_loss = KL.Lambda(lambda x: mrcnn_mask_loss_graph(*x), name="mrcnn_mask_loss")(
[target_mask, target_class_ids, mrcnn_mask])

1.9--總的模型

上面只是介紹了模型的每一步,要把各個模型串聯起來,才可以形成一個整體的網絡,網絡的整體定義如下:

# Model
inputs = [input_image, input_image_meta,
input_rpn_match, input_rpn_bbox, input_gt_class_ids, input_gt_boxes, input_gt_masks]
if not config.USE_RPN_ROIS:
inputs.append(input_rois)
outputs = [rpn_class_logits, rpn_class, rpn_bbox,
mrcnn_class_logits, mrcnn_class, mrcnn_bbox, mrcnn_mask,
rpn_rois, output_rois,
rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, mask_loss]
model = KM.Model(inputs, outputs, name='mask_rcnn')

2--模型的訓練

在模型訓練的時候,會根據配置文件讀取信息,將數據讀取到一個dataset的對象中,會計算出圖片真實的anchor和 mask在模型的rpn階段計算loss值使用。這個代碼可以設置每次訓練的層數,甚至可以只訓練某基層。訓練的時候最重要的是data_generator的生成:

def data_generator(dataset, config, shuffle=True, augment=False, augmentation=None,
random_rois=0, batch_size=1, detection_targets=False,
no_augmentation_sources=None):
"""A generator that returns images and corresponding target class ids,
bounding box deltas, and masks.

dataset: The Dataset object to pick data from
config: The model config object
shuffle: If True, shuffles the samples before every epoch
augment: (deprecated. Use augmentation instead). If true, apply random
image augmentation. Currently, only horizontal flipping is offered.
augmentation: Optional. An imgaug (https://github.com/aleju/imgaug) augmentation.
For example, passing imgaug.augmenters.Fliplr(0.5) flips images
right/left 50% of the time.
random_rois: If > 0 then generate proposals to be used to train the
network classifier and mask heads. Useful if training
the Mask RCNN part without the RPN.
batch_size: How many images to return in each call
detection_targets: If True, generate detection targets (class IDs, bbox
deltas, and masks). Typically for debugging or visualizations because
in trainig detection targets are generated by DetectionTargetLayer.
no_augmentation_sources: Optional. List of sources to exclude for
augmentation. A source is string that identifies a dataset and is
defined in the Dataset class.

Returns a Python generator. Upon calling next() on it, the
generator returns two lists, inputs and outputs. The contents
of the lists differs depending on the received arguments:
inputs list:
- images: [batch, H, W, C]
- image_meta: [batch, (meta data)] Image details. See compose_image_meta()
- rpn_match: [batch, N] Integer (1=positive anchor, -1=negative, 0=neutral)
- rpn_bbox: [batch, N, (dy, dx, log(dh), log(dw))] Anchor bbox deltas.
- gt_class_ids: [batch, MAX_GT_INSTANCES] Integer class IDs
- gt_boxes: [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)]
- gt_masks: [batch, height, width, MAX_GT_INSTANCES]. The height and width
are those of the image unless use_mini_mask is True, in which
case they are defined in MINI_MASK_SHAPE.

outputs list: Usually empty in regular training. But if detection_targets
is True then the outputs list contains target class_ids, bbox deltas,
and masks.
"""
b = 0 # batch item index
image_index = -1
image_ids = np.copy(dataset.image_ids)
error_count = 0
no_augmentation_sources = no_augmentation_sources or []

# Anchors
# [anchor_count, (y1, x1, y2, x2)] [[256 256] [128 128][ 64 64][ 32 32][ 16 16]]
backbone_shapes = compute_backbone_shapes(config, config.IMAGE_SHAPE)
anchors = utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES,
config.RPN_ANCHOR_RATIOS,
backbone_shapes,
config.BACKBONE_STRIDES,
config.RPN_ANCHOR_STRIDE)

# Keras requires a generator to run indefinitely.
while True:
try:
# Increment index to pick next image. Shuffle if at the start of an epoch.
image_index = (image_index + 1) % len(image_ids)
if shuffle and image_index == 0:
np.random.shuffle(image_ids)

# Get GT bounding boxes and masks for image.
image_id = image_ids[image_index]

# If the image source is not to be augmented pass None as augmentation
if dataset.image_info[image_id]['source'] in no_augmentation_sources:
image, image_meta, gt_class_ids, gt_boxes, gt_masks = \
load_image_gt(dataset, config, image_id, augment=augment,
augmentation=None,
use_mini_mask=config.USE_MINI_MASK)
else:
image, image_meta, gt_class_ids, gt_boxes, gt_masks = \
load_image_gt(dataset, config, image_id, augment=augment,
augmentation=augmentation,
use_mini_mask=config.USE_MINI_MASK)

# Skip images that have no instances. This can happen in cases
# where we train on a subset of classes and the image doesn't
# have any of the classes we care about.
if not np.any(gt_class_ids > 0):
continue

# RPN Targets rpn_match代表anchor是正樣本還是負樣本還是中性樣本
rpn_match, rpn_bbox = build_rpn_targets(image.shape, anchors,
gt_class_ids, gt_boxes, config)

# Mask R-CNN Targets
if random_rois:
rpn_rois = generate_random_rois(
image.shape, random_rois, gt_class_ids, gt_boxes)
if detection_targets:
rois, mrcnn_class_ids, mrcnn_bbox, mrcnn_mask =\
build_detection_targets(
rpn_rois, gt_class_ids, gt_boxes, gt_masks, config)

# Init batch arrays
if b == 0:
batch_image_meta = np.zeros(
(batch_size,) + image_meta.shape, dtype=image_meta.dtype)
batch_rpn_match = np.zeros(
[batch_size, anchors.shape[0], 1], dtype=rpn_match.dtype)
batch_rpn_bbox = np.zeros(
[batch_size, config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4], dtype=rpn_bbox.dtype)
batch_images = np.zeros(
(batch_size,) + image.shape, dtype=np.float32)
batch_gt_class_ids = np.zeros(
(batch_size, config.MAX_GT_INSTANCES), dtype=np.int32)
batch_gt_boxes = np.zeros(
(batch_size, config.MAX_GT_INSTANCES, 4), dtype=np.int32)
batch_gt_masks = np.zeros(
(batch_size, gt_masks.shape[0], gt_masks.shape[1],
config.MAX_GT_INSTANCES), dtype=gt_masks.dtype)
if random_rois:
batch_rpn_rois = np.zeros(
(batch_size, rpn_rois.shape[0], 4), dtype=rpn_rois.dtype)
if detection_targets:
batch_rois = np.zeros(
(batch_size,) + rois.shape, dtype=rois.dtype)
batch_mrcnn_class_ids = np.zeros(
(batch_size,) + mrcnn_class_ids.shape, dtype=mrcnn_class_ids.dtype)
batch_mrcnn_bbox = np.zeros(
(batch_size,) + mrcnn_bbox.shape, dtype=mrcnn_bbox.dtype)
batch_mrcnn_mask = np.zeros(
(batch_size,) + mrcnn_mask.shape, dtype=mrcnn_mask.dtype)

# If more instances than fits in the array, sub-sample from them.
if gt_boxes.shape[0] > config.MAX_GT_INSTANCES:
ids = np.random.choice(
np.arange(gt_boxes.shape[0]), config.MAX_GT_INSTANCES, replace=False)
gt_class_ids = gt_class_ids[ids]
gt_boxes = gt_boxes[ids]
gt_masks = gt_masks[:, :, ids]

# Add to batch
batch_image_meta[b] = image_meta
batch_rpn_match[b] = rpn_match[:, np.newaxis]
batch_rpn_bbox[b] = rpn_bbox
batch_images[b] = mold_image(image.astype(np.float32), config)
batch_gt_class_ids[b, :gt_class_ids.shape[0]] = gt_class_ids
batch_gt_boxes[b, :gt_boxes.shape[0]] = gt_boxes
batch_gt_masks[b, :, :, :gt_masks.shape[-1]] = gt_masks
if random_rois:
batch_rpn_rois[b] = rpn_rois
if detection_targets:
batch_rois[b] = rois
batch_mrcnn_class_ids[b] = mrcnn_class_ids
batch_mrcnn_bbox[b] = mrcnn_bbox
batch_mrcnn_mask[b] = mrcnn_mask
b += 1

# Batch full?
if b >= batch_size:
inputs = [batch_images, batch_image_meta, batch_rpn_match, batch_rpn_bbox,
batch_gt_class_ids, batch_gt_boxes, batch_gt_masks]
outputs = []

if random_rois:
inputs.extend([batch_rpn_rois])
if detection_targets:
inputs.extend([batch_rois])
# Keras requires that output and targets have the same number of dimensions
batch_mrcnn_class_ids = np.expand_dims(
batch_mrcnn_class_ids, -1)
outputs.extend(
[batch_mrcnn_class_ids, batch_mrcnn_bbox, batch_mrcnn_mask])

yield inputs, outputs

# start a new batch
b = 0
except (GeneratorExit, KeyboardInterrupt):
raise
except:
# Log it and skip the image
logging.exception("Error processing image {}".format(
dataset.image_info[image_id]))
error_count += 1
if error_count > 5:
raise

3--模型的inference

在模型的預測階段,我們調用的是模型的detect方法,首先讀取要預測的圖片。然後調用predict方法

 # Run object detection
detections, _, _, mrcnn_mask, _, _, _ =\
self.keras_model.predict([molded_images, image_metas, anchors], verbose=0)

得到的預測結果是歸一化以後的結果,預測的mask也是在28*28的大小的結果,因此,要利用雙線性插值的方式將mask的大小轉換到和原始圖片一樣的大小上。


def unmold_detections(self, detections, mrcnn_mask, original_image_shape,
image_shape, window):
"""Reformats the detections of one image from the format of the neural
network output to a format suitable for use in the rest of the
application.

detections: [N, (y1, x1, y2, x2, class_id, score)] in normalized coordinates
mrcnn_mask: [N, height, width, num_classes]
original_image_shape: [H, W, C] Original image shape before resizing
image_shape: [H, W, C] Shape of the image after resizing and padding
window: [y1, x1, y2, x2] Pixel coordinates of box in the image where the real
image is excluding the padding.

Returns:
boxes: [N, (y1, x1, y2, x2)] Bounding boxes in pixels
class_ids: [N] Integer class IDs for each bounding box
scores: [N] Float probability scores of the class_id
masks: [height, width, num_instances] Instance masks
"""
# How many detections do we have?
# Detections array is padded with zeros. Find the first class_id == 0.
zero_ix = np.where(detections[:, 4] == 0)[0]
N = zero_ix[0] if zero_ix.shape[0] > 0 else detections.shape[0]

# Extract boxes, class_ids, scores, and class-specific masks
boxes = detections[:N, :4]
class_ids = detections[:N, 4].astype(np.int32)
scores = detections[:N, 5]
masks = mrcnn_mask[np.arange(N), :, :, class_ids]

# Translate normalized coordinates in the resized image to pixel
# coordinates in the original image before resizing
window = utils.norm_boxes(window, image_shape[:2])
wy1, wx1, wy2, wx2 = window
shift = np.array([wy1, wx1, wy1, wx1])
wh = wy2 - wy1 # window height
ww = wx2 - wx1 # window width
scale = np.array([wh, ww, wh, ww])
# Convert boxes to normalized coordinates on the window
boxes = np.divide(boxes - shift, scale)
# Convert boxes to pixel coordinates on the original image
boxes = utils.denorm_boxes(boxes, original_image_shape[:2])

# Filter out detections with zero area. Happens in early training when
# network weights are still random
exclude_ix = np.where(
(boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
if exclude_ix.shape[0] > 0:
boxes = np.delete(boxes, exclude_ix, axis=0)
class_ids = np.delete(class_ids, exclude_ix, axis=0)
scores = np.delete(scores, exclude_ix, axis=0)
masks = np.delete(masks, exclude_ix, axis=0)
N = class_ids.shape[0]

# Resize masks to original image size and set boundary threshold.
full_masks = []
for i in range(N):
# Convert neural network mask to full size mask
full_mask = utils.unmold_mask(masks[i], boxes[i], original_image_shape)
full_masks.append(full_mask)
full_masks = np.stack(full_masks, axis=-1)\
if full_masks else np.empty(original_image_shape[:2] + (0,))

return boxes, class_ids, scores, full_masks

下面的代碼是將神經網絡預測得到的mask轉換到原始的圖片上:

def unmold_mask(mask, bbox, image_shape):
"""Converts a mask generated by the neural network to a format similar
to its original shape.
mask: [height, width] of type float. A small, typically 28x28 mask.
bbox: [y1, x1, y2, x2]. The box to fit the mask in.

Returns a binary mask with the same size as the original image.
"""
threshold = 0.5
y1, x1, y2, x2 = bbox
mask = skimage.transform.resize(mask, (y2 - y1, x2 - x1), order=1, mode="constant")
mask = np.where(mask >= threshold, 1, 0).astype(np.bool)

# Put the mask in the right location.
full_mask = np.zeros(image_shape[:2], dtype=np.bool)
full_mask[y1:y2, x1:x2] = mask
return full_mask

相關焦點

  • 【從零開始學Mask RCNN】二,Mask RCNN框架整體把握
    Mask RCNN總覽下面的Figure1展示了這個工程中Mask RCNN的網絡結構示意圖,來自博主疊加態的貓:Mask RCNN總覽,來自疊加態的貓博主這個結構圖裡面包含了很多細節,我們先大概理一下,後面在源碼詳解中會更詳細的說明:首先是BackBone
  • mask rcnn訓練自己的數據集
    3D視覺工坊的第67篇文章前言最近迷上了mask rcnn,也是由於自己工作需要吧,特意研究了其原始碼,並基於自己的數據進行訓練~本博客參考:https://blog.csdn.net/disiwei1012/article/details/79928679#commentsedit實驗目的
  • 愷明大神 Mask R-CNN 超實用教程
    對象檢測器,如yolo、faster r-cnn和ssd,生成四組(x,y)坐標,表示圖像中對象的邊界框。從獲取對象的邊界框開始挺好的,但是邊界框本身並不能告訴我們(1)哪些像素屬於前景對象,(2)哪些像素屬於背景。是否可以為圖像中的每個對象生成一個MASK,從而允許我們從背景分割前景對象?
  • 實戰 | 源碼入門之Faster RCNN
    前言學習深度學習和計算機視覺,特別是目標檢測方向的學習者,一定聽說過Faster Rcnn;在目標檢測領域,Faster Rcnn表現出了極強的生命力,被大量的學習者學習,研究和工程應用。網上有很多版本的Faster RCNN的源碼,但是很多版本代碼太過於龐大,對新入門的學習者學習起來很不友好,在網上苦苦尋找了一番後終於找到了一個適合源碼學習的Faster Rcnn的pytorch版本代碼。根據該版本的作者講該代碼除去注釋只有兩千行左右,並且經過小編的一番學習之後,發現該版本的代碼真的是非常的精簡幹練,讀起來「朗朗上口」,並且深刻的感覺到作者代碼功底之深厚。
  • 愷明大神 Mask R-CNN 超實用教程
    對象檢測器,如yolo、faster r-cnn和ssd,生成四組(x,y)坐標,表示圖像中對象的邊界框。從獲取對象的邊界框開始挺好的,但是邊界框本身並不能告訴我們(1)哪些像素屬於前景對象,(2)哪些像素屬於背景。這就引出了一個問題:是否可以為圖像中的每個對象生成一個MASK,從而允許我們從背景分割前景對象?
  • 輕鬆掌握 MMDetection 中常用算法(二):Faster R-CNN|Mask R-CNN
    本文解讀的 Faster R-CNN 網絡實際上也是指的 FPN。在 FPN 提出後,Kaiming He 等進一步對其進行任務擴展,提出了 Mask R-CNN,通過新增 mask 掩碼分支實現實例分割任務,其最大特點是任務擴展性強,通過新增不同分支就可以實現不同的擴展任務。例如可以將 mask 分支替換為關鍵點分支即可實現多人姿態估計。
  • 從零開始 Mask RCNN 實戰:基於 Win10 + Anaconda 的 Mask RCNN 環境搭建
    同時下載 Mask RCNN 的預訓練模型 「mask_rcnn_coco.h5」,放置於本地 Mask_RCNN 開源庫的根目錄下。「mask_rcnn_coco.h5」 下載地址:https://github.com/matterport/Mask_RCNN/releases在裡面的 Mask R-CNN 2.0 下找到 「mask_rcnn_coco.h5」 並下載。3.
  • 【前沿】何愷明大神ICCV2017最佳論文Mask R-CNN的Keras/TensorFlow/Pytorch 代碼實現
    我們的方法能夠有效地檢測圖像中的目標,同時還能為每個實例生成一個高質量的分割掩碼(segmentation mask)。這個方面被稱為 Mask R-CNN,是在 Faster R-CNN 上的擴展——在其已有的用於邊界框識別的分支上添加了一個並行的用於預測目標掩碼的分支。Mask R-CNN 的訓練很簡單,僅比 Faster R-CNN 多一點計算開銷,運行速度為 5 fps。
  • 如何使用Mask RCNN模型進行圖像實體分割?
    'rois'], r['masks'],r['class_ids'],                            dataset.class_names, r['scores'], ax=ax,                           title="Predictions")
  • 何愷明團隊:從特徵金字塔網絡、Mask R-CNN到學習分割一切
    添加一個預測分割mask的並行分支——這是一個FCN。Loss是的和ROIlign Layer而不是ROIPool。這就不會像ROIPool那樣將(x / spatial_scale)分數捨入為整數,相反,它執行雙線性插值來找出那些浮點值處的像素。例如:想像一下,ROI的高度和寬度分別為54,167。
  • 輕鬆學Pytorch – 行人檢測Mask-RCNN模型訓練與使用
    這個例子是來自Pytorch官方的教程,我這裡是根據我自己的實踐重新整理跟解讀了一下,分享給大家。前面一篇已經詳細分享了關於模型本身,格式化輸入與輸出的結果。這裡使用的預訓練模型是ResNet50作為backbone網絡,實現模型的參數微調遷移學習。輸入的數據是RGB三通道的,取值範圍rescale到0~1之間。
  • 輕鬆學 Pytorch:行人檢測 Mask-RCNN 模型訓練與使用
    這個例子是來自Pytorch官方的教程,我這裡是根據我自己的實踐重新整理跟解讀了一下,分享給大家。前面一篇已經詳細分享了關於模型本身,格式化輸入與輸出的結果。這裡使用的預訓練模型是ResNet50作為backbone網絡,實現模型的參數微調遷移學習。輸入的數據是RGB三通道的,取值範圍rescale到0~1之間。
  • Mask R-CNN with OpenCV(上)
    本教程中使用的Mask R-CNN是在COCO dataset上訓練的,其具有L=90個類別,因此Mask R-CNN的mask module的體積大小是100*90*15*15。我們看下圖來想像一下Mask R-CNN的過程:
  • 一個maskrcnn的目標檢測和實例分割的小例子
    代碼 以及運行教程  獲取:關注微信公眾號 datayx  然後回復  mask  即可獲取。AI項目體驗地址 https://loveai.tech下面介紹採坑:1.有兩種方式來修改torchvision modelzoo中的模型,以達到預期的目的。
  • Mask-RCNN最詳細解讀
    不過把它們拿來比當然不公平,但我更想說的是,mask RCNN效果真的很好。所以這篇文章來詳細地總結一下Mask RCNN。Mask RCNN沿用了Faster RCNN的思想,特徵提取採用ResNet-FPN的架構,另外多加了一個Mask預測分支。可見Mask RCNN綜合了很多此前優秀的研究成果。
  • 輕鬆學Pytorch –Mask-RCNN圖像實例分割
    Pytorch中使用Mask-RCNN實現實例分割,是基於torchvision的預訓練模型庫,首先需要下載預訓練模型,並檢查是否可以支持GPU推理,相關的代碼如下:model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)model.eval
  • 何愷明團隊計算機視覺最新進展:從特徵金字塔網絡、Mask R-CNN到學習分割一切
    ,master/configs/12_2017_baselines• Keras - https://github.com/matterport/Mask_RCNN/• PyTorch - https://github.com/soeaver/Pytorch_Mask_RCNN/• MXNet - https://github.com/TuSimple/mx-maskrcnn
  • Faster R-CNN
    Mask r-cnn[C]//Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017: 2980-2988.
  • 何愷明等最新論文:實例分割全新方法TensorMask,效果比肩 Mask R-CNN
    例如,每個 mask 本身是一個 2D 空間映射,較大對象的 mask 可以受益於較大空間映射的使用。為 dense masks 開發有效的表示是實現密集實例分割的關鍵步驟。為了解決這個問題,我們定義了一組用高維張量表示 mask 的核心概念,這些概念允許探索用於 dense mask prediction 的新穎網絡架構。
  • CNN 10 - April 24, 2020
    https://pmd.cdn.turner.com/cnn/big/cnn10/2020/04/23/caption/ten-0424.cnn_3188636_768x432_1300k.mp4CARL AZUZ, CNN 10 ANCHOR:  Breaking news on CNN 10.