https://github.com/matterport/Mask_RCNN
這個是一個基於Keras寫的maskrcnn的源碼,作者寫的非常nice。沒有多餘的問文件,源碼都放在mrcnn中,readme裡面有詳細的介紹,
為了了解maskrcnn的運行流程,最好的辦法就是將代碼邊運行邊調試。從samples下面的coco文件開始運行:
前面首先會加載一些配置文件的數據,暫時用不到沒必要去具體看這些配置文件,等到用到的時候,再回頭去看這些細節。MaskRCNN的整體流程圖為:
首先要看的就是模型的搭建,也就是maskrcnn的基礎網絡結構。
1--模型搭建跳轉到對應的方法中
1.1--在build方法中,首先要創建一些模型的輸入變量。由於此時我們並不知道輸入變量的具體大小,因此我們可以使用Keras的變量進行佔位操作,等到具體訓練的時候根據傳入的參數確定變量的大小。train和inference的時候,都是先build的模型,因此他們共有一個模型,inference的時候,需要的輸入變量比較少。gt_boxes使用了歸一化的操作,也就是所有的邊框的坐標都是0到1之間,這樣做的好處是避免數值大小帶來的預測誤差。
1.2 --搭建backbone網絡搭建特徵提取網絡用於圖片的特徵的提取,maskrcnn使用了金字塔(FPN)的網絡方式進行特徵的提取,使用的網絡是resnet101 ,具體的網絡的可視化可以參考:http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50
然後利用不同的特徵去做不同的事情,代碼中的使用方式如下,P2,P3,P4,P5,P6用於rpn網絡提取信息。由於上面都進行了3*3的卷積操作,保證了不同層的特徵的channel數目一樣。
有了這些特徵圖,就可以用這些特徵圖生成anchor。
1.3--RPN網絡搭建rpn網絡,將1.2步得到的金字塔特徵分別輸入到rpn網絡中,得到網絡的分類和回歸值。
rpn_logits: [batch, H, W, 2] Anchor classifier logits (before softmax)
rpn_probs: [batch, W, W, 2] Anchor classifier probabilities.
rpn_bbox: [batch, H, W, (dy, dx, log(dh), log(dw))] Deltas to be
applied to anchors.1.4--ProposalLayer將rpn網路的輸出應用到1.2步得到的anchors,首先對輸出的概率進行排序,保留其中預測為前景色概率大的一部分(具體值可以在配置文件中進行配置),然後選取想對應的anchor,利用rpn的輸出回歸值對anchor進行第一次修正。修正完利用極大抑制方法,刪除其中的一部分anchor。獲的最後的anchor。
#ProposalLayer的作用主要
# 1. 根據rpn網絡,獲取score靠前的前6000個anchor
# 2. 利用rpn_bbox對anchors進行修正
# 3. 捨棄掉修正後邊框超過圖片大小的anchor,由於我們的anchor的坐標的大小是歸一化的,只要坐標不超過0 1區間即可
# 4. 利用非極大抑制的方法獲得最後的anchor1.5--DetectionTargetLayerDetectionTargetLayer的輸入包含了,target_rois, input_gt_class_ids, gt_boxes, input_gt_masks。其中target_rois是ProposalLayer輸出的結果。首先,計算target_rois中的每一個rois和哪一個真實的框gt_boxes iou值,如果最大的iou大於0.5,則被認為是正樣本,負樣本是是iou小於0.5並且和crowd box相交不大的anchor,選擇出了正負樣本,還要保證樣本的均衡性,具體可以才配置文件中進行配置。最後計算了正樣本中的anchor和哪一個真實的框最接近,用真實的框和anchor計算出偏移值,並且將mask的大小resize成28*28的(我猜測利用的是雙線性差值的方式,因為mask的值不是0就是1,0是背景,一是前景)這些都是後面的分類和mask網絡要用到的真實的值。下面是該層的主要代碼:
獲取的就是每個rois和哪個真實的框最接近,計算出和真實框的距離,以及要預測的mask,這些信息都會在網絡的頭的classify和mask網絡
#所使用
def detection_targets_graph(proposals, gt_class_ids, gt_boxes, gt_masks, config):
"""Generates detection targets for one image. Subsamples proposals and
generates target class IDs, bounding box deltas, and masks for each.
Inputs:
proposals: [N, (y1, x1, y2, x2)] in normalized coordinates. Might
be zero padded if there are not enough proposals.
gt_class_ids: [MAX_GT_INSTANCES] int class IDs
gt_boxes: [MAX_GT_INSTANCES, (y1, x1, y2, x2)] in normalized coordinates.
gt_masks: [height, width, MAX_GT_INSTANCES] of boolean type.
Returns: Target ROIs and corresponding class IDs, bounding box shifts,
and masks.
rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates
class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded.
deltas: [TRAIN_ROIS_PER_IMAGE, NUM_CLASSES, (dy, dx, log(dh), log(dw))]
Class-specific bbox refinements.
masks: [TRAIN_ROIS_PER_IMAGE, height, width). Masks cropped to bbox
boundaries and resized to neural network output size.
Note: Returned arrays might be zero padded if not enough target ROIs.
"""
# Assertions
asserts = [
tf.Assert(tf.greater(tf.shape(proposals)[0], 0), [proposals],
name="roi_assertion"),
]
with tf.control_dependencies(asserts):
proposals = tf.identity(proposals)
# Remove zero padding
proposals, _ = trim_zeros_graph(proposals, name="trim_proposals")
#去除非零的真實的框,也就是只留下真實存在的有意義的框
gt_boxes, non_zeros = trim_zeros_graph(gt_boxes, name="trim_gt_boxes")
gt_class_ids = tf.boolean_mask(gt_class_ids, non_zeros,
name="trim_gt_class_ids")
gt_masks = tf.gather(gt_masks, tf.where(non_zeros)[:, 0], axis=2,
name="trim_gt_masks")
# Handle COCO crowds
# A crowd box in COCO is a bounding box around several instances. Exclude
# them from training. A crowd box is given a negative class ID.
#在coco數據集中,有的框會標註很多的物體,在訓練中,去掉這些框
crowd_ix = tf.where(gt_class_ids < 0)[:, 0]
non_crowd_ix = tf.where(gt_class_ids > 0)[:, 0]
crowd_boxes = tf.gather(gt_boxes, crowd_ix)
crowd_masks = tf.gather(gt_masks, crowd_ix, axis=2)
#下面就是一張圖片中真實存在的物體用於訓練
gt_class_ids = tf.gather(gt_class_ids, non_crowd_ix)
gt_boxes = tf.gather(gt_boxes, non_crowd_ix)
gt_masks = tf.gather(gt_masks, non_crowd_ix, axis=2)
# Compute overlaps matrix [proposals, gt_boxes]
#計算iou的值
overlaps = overlaps_graph(proposals, gt_boxes)
# Compute overlaps with crowd boxes [anchors, crowds]
crowd_overlaps = overlaps_graph(proposals, crowd_boxes)
crowd_iou_max = tf.reduce_max(crowd_overlaps, axis=1)
no_crowd_bool = (crowd_iou_max < 0.001)
# Determine positive and negative ROIs
roi_iou_max = tf.reduce_max(overlaps, axis=1)
# 1. Positive ROIs are those with >= 0.5 IoU with a GT box
#和真實的框的iou值大於0.5時,被認為是正樣本
positive_roi_bool = (roi_iou_max >= 0.5)
positive_indices = tf.where(positive_roi_bool)[:, 0]
# 2. Negative ROIs are those with < 0.5 with every GT box. Skip crowds.
#負樣本是是iou小於0.5並且和crowd box相交不大的anchor
negative_indices = tf.where(tf.logical_and(roi_iou_max < 0.5, no_crowd_bool))[:, 0]
# Subsample ROIs. Aim for 33% positive
# Positive ROIs
positive_count = int(config.TRAIN_ROIS_PER_IMAGE *
config.ROI_POSITIVE_RATIO)
positive_indices = tf.random_shuffle(positive_indices)[:positive_count]
positive_count = tf.shape(positive_indices)[0]
# Negative ROIs. Add enough to maintain positive:negative ratio.
r = 1.0 / config.ROI_POSITIVE_RATIO
negative_count = tf.cast(r * tf.cast(positive_count, tf.float32), tf.int32) - positive_count
negative_indices = tf.random_shuffle(negative_indices)[:negative_count]
# Gather selected ROIs
#選擇出正負樣本
positive_rois = tf.gather(proposals, positive_indices)
negative_rois = tf.gather(proposals, negative_indices)
# Assign positive ROIs to GT boxes.
#計算正樣本和哪個真實的框最接近
positive_overlaps = tf.gather(overlaps, positive_indices)
roi_gt_box_assignment = tf.cond(
tf.greater(tf.shape(positive_overlaps)[1], 0),
true_fn = lambda: tf.argmax(positive_overlaps, axis=1),
false_fn = lambda: tf.cast(tf.constant([]),tf.int64)
)
roi_gt_boxes = tf.gather(gt_boxes, roi_gt_box_assignment)
roi_gt_class_ids = tf.gather(gt_class_ids, roi_gt_box_assignment)
# Compute bbox refinement for positive ROIs
#用最接近的真實框修正rpn網絡預測的框
deltas = utils.box_refinement_graph(positive_rois, roi_gt_boxes)
deltas /= config.BBOX_STD_DEV
# Assign positive ROIs to GT masks
# Permute masks to [N, height, width, 1]
transposed_masks = tf.expand_dims(tf.transpose(gt_masks, [2, 0, 1]), -1)
# Pick the right mask for each ROI
# 計算和每一個rois最接近的框的mask
roi_masks = tf.gather(transposed_masks, roi_gt_box_assignment)
# Compute mask targets
boxes = positive_rois
if config.USE_MINI_MASK:
# Transform ROI coordinates from normalized image space
# to normalized mini-mask space.
y1, x1, y2, x2 = tf.split(positive_rois, 4, axis=1)
gt_y1, gt_x1, gt_y2, gt_x2 = tf.split(roi_gt_boxes, 4, axis=1)
gt_h = gt_y2 - gt_y1
gt_w = gt_x2 - gt_x1
y1 = (y1 - gt_y1) / gt_h
x1 = (x1 - gt_x1) / gt_w
y2 = (y2 - gt_y1) / gt_h
x2 = (x2 - gt_x1) / gt_w
boxes = tf.concat([y1, x1, y2, x2], 1)
box_ids = tf.range(0, tf.shape(roi_masks)[0])
# crop_and_resize相當於roipolling的操作
masks = tf.image.crop_and_resize(tf.cast(roi_masks, tf.float32), boxes,
box_ids,
config.MASK_SHAPE)
# Remove the extra dimension from masks.
masks = tf.squeeze(masks, axis=3)
# Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with
# binary cross entropy loss.
masks = tf.round(masks)
# Append negative ROIs and pad bbox deltas and masks that
# are not used for negative ROIs with zeros.
rois = tf.concat([positive_rois, negative_rois], axis=0)
N = tf.shape(negative_rois)[0]
P = tf.maximum(config.TRAIN_ROIS_PER_IMAGE - tf.shape(rois)[0], 0)
rois = tf.pad(rois, [(0, P), (0, 0)])
roi_gt_boxes = tf.pad(roi_gt_boxes, [(0, N + P), (0, 0)])
roi_gt_class_ids = tf.pad(roi_gt_class_ids, [(0, N + P)])
deltas = tf.pad(deltas, [(0, N + P), (0, 0)])
masks = tf.pad(masks, [[0, N + P], (0, 0), (0, 0)])
return rois, roi_gt_class_ids, deltas, masks最後返回的是:
rois: [TRAIN_ROIS_PER_IMAGE, (y1, x1, y2, x2)] in normalized coordinates
class_ids: [TRAIN_ROIS_PER_IMAGE]. Integer class IDs. Zero padded.
deltas: [TRAIN_ROIS_PER_IMAGE, NUM_CLASSES, (dy, dx, log(dh), log(dw))]
Class-specific bbox refinements.
masks: [TRAIN_ROIS_PER_IMAGE, height, width). Masks cropped to bbox
boundaries and resized to neural network output size.通過rpn網絡得到的anchor,選擇出來正負樣本,並計算出正樣本和真實框的差距,以及要預測的mask的值,這些都是在後面的網絡中計算損失函數需要的真實值。1.6--Feature Pyramid Network Heads(fpn_classifier_graph)該網絡是maskrcnn的最後一層,與之並行的還有一個mask分支,在這裡先介紹一下這個分類網絡。
由1.5得到的roi的大小並不一樣,因此,需要一個網絡類似於fasterrcnn中的roipooling,將rois轉換成大小一樣的特徵圖。maskrcnn中使用的是PyramidROIAlign。PyramidROIAlign首先根據下面的公司計算每一個roi來自於金字塔特徵的P2到P5的哪一層的特徵:
# Assign each ROI to a level in the pyramid based on the ROI area.
y1, x1, y2, x2 = tf.split(boxes, 4, axis=2)
h = y2 - y1
w = x2 - x1
# Use shape of first image. Images in a batch must have the same size.
image_shape = parse_image_meta_graph(image_meta)['image_shape'][0]
# Equation 1 in the Feature Pyramid Networks paper. Account for
# the fact that our coordinates are normalized here.
# e.g. a 224x224 ROI (in pixels) maps to P4
#計算每個roi映射到哪一層的金字塔特徵的輸出上
image_area = tf.cast(image_shape[0] * image_shape[1], tf.float32)
roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))
roi_level = tf.minimum(5, tf.maximum(
2, 4 + tf.cast(tf.round(roi_level), tf.int32)))
roi_level = tf.squeeze(roi_level, 2)然後從對應的特徵圖中取出坐標對應的區域,利用雙線性插值的方式進行pooling操作。PyramidROIAlign會返回resize成相同大小的rois。
將得到的特徵塊輸入到fpn_classifier_graph網絡中,得到分類和回歸值。
下面是fpn_classifier_graph網絡的定義,
def fpn_classifier_graph(rois, feature_maps, image_meta,
pool_size, num_classes, train_bn=True,
fc_layers_size=1024):
"""Builds the computation graph of the feature pyramid network classifier
and regressor heads.
rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalized
coordinates.
feature_maps: List of feature maps from different layers of the pyramid,
[P2, P3, P4, P5]. Each has a different resolution.
- image_meta: [batch, (meta data)] Image details. See compose_image_meta()
pool_size: The width of the square feature map generated from ROI Pooling.
num_classes: number of classes, which determines the depth of the results
train_bn: Boolean. Train or freeze Batch Norm layers
fc_layers_size: Size of the 2 FC layers
Returns:
logits: [N, NUM_CLASSES] classifier logits (before softmax)
probs: [N, NUM_CLASSES] classifier probabilities
bbox_deltas: [N, (dy, dx, log(dh), log(dw))] Deltas to apply to
proposal boxes
"""
# ROI Pooling
# Shape: [batch, num_boxes, pool_height, pool_width, channels]
x = PyramidROIAlign([pool_size, pool_size],
name="roi_align_classifier")([rois, image_meta] + feature_maps)
# Two 1024 FC layers (implemented with Conv2D for consistency)
x = KL.TimeDistributed(KL.Conv2D(fc_layers_size, (pool_size, pool_size), padding="valid"),
name="mrcnn_class_conv1")(x)
x = KL.TimeDistributed(BatchNorm(), name='mrcnn_class_bn1')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(fc_layers_size, (1, 1)),
name="mrcnn_class_conv2")(x)
x = KL.TimeDistributed(BatchNorm(), name='mrcnn_class_bn2')(x, training=train_bn)
x = KL.Activation('relu')(x)
shared = KL.Lambda(lambda x: K.squeeze(K.squeeze(x, 3), 2),
name="pool_squeeze")(x)
# Classifier head
mrcnn_class_logits = KL.TimeDistributed(KL.Dense(num_classes),
name='mrcnn_class_logits')(shared)
mrcnn_probs = KL.TimeDistributed(KL.Activation("softmax"),
name="mrcnn_class")(mrcnn_class_logits)
# BBox head
# [batch, boxes, num_classes * (dy, dx, log(dh), log(dw))]
x = KL.TimeDistributed(KL.Dense(num_classes * 4, activation='linear'),
name='mrcnn_bbox_fc')(shared)
# Reshape to [batch, boxes, num_classes, (dy, dx, log(dh), log(dw))]
s = K.int_shape(x)
mrcnn_bbox = KL.Reshape((s[1], num_classes, 4), name="mrcnn_bbox")(x)
return mrcnn_class_logits, mrcnn_probs, mrcnn_bbox返回值為:
Returns:
logits: [N, NUM_CLASSES] classifier logits (before softmax)
probs: [N, NUM_CLASSES] classifier probabilities
bbox_deltas: [N, (dy, dx, log(dh), log(dw))] Deltas to apply to
proposal boxes1.7--build_fpn_mask_graphmask網絡的輸入和1.6網絡的輸入值是一樣的,也會經過PyramidROIAlign(這個地方可以進行一個提取,放在最後的網絡之前)
#創建mask的分類頭
def build_fpn_mask_graph(rois, feature_maps, image_meta,
pool_size, num_classes, train_bn=True):
"""Builds the computation graph of the mask head of Feature Pyramid Network.
rois: [batch, num_rois, (y1, x1, y2, x2)] Proposal boxes in normalized
coordinates.
feature_maps: List of feature maps from different layers of the pyramid,
[P2, P3, P4, P5]. Each has a different resolution.
image_meta: [batch, (meta data)] Image details. See compose_image_meta()
pool_size: The width of the square feature map generated from ROI Pooling.
num_classes: number of classes, which determines the depth of the results
train_bn: Boolean. Train or freeze Batch Norm layers
Returns: Masks [batch, roi_count, height, width, num_classes]
"""
# ROI Pooling
# Shape: [batch, boxes, pool_height, pool_width, channels]
x = PyramidROIAlign([pool_size, pool_size],
name="roi_align_mask")([rois, image_meta] + feature_maps)
# Conv layers
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv1")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn1')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv2")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn2')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv3")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn3')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2D(256, (3, 3), padding="same"),
name="mrcnn_mask_conv4")(x)
x = KL.TimeDistributed(BatchNorm(),
name='mrcnn_mask_bn4')(x, training=train_bn)
x = KL.Activation('relu')(x)
x = KL.TimeDistributed(KL.Conv2DTranspose(256, (2, 2), strides=2, activation="relu"),
name="mrcnn_mask_deconv")(x)
x = KL.TimeDistributed(KL.Conv2D(num_classes, (1, 1), strides=1, activation="sigmoid"),
name="mrcnn_mask")(x)
return x有一個細節需要注意的就是1.6經過PyramidROIAlign得到的特徵圖是7*7大小的,二1.7經過PyramidROIAlign得到的特徵圖大小是14*14
最後的返回值是:
Returns: Masks [batch, roi_count, height, width, num_classes]1.8 --損失函數maskrcnn中總共有五個損失函數,分別是rpn網絡的兩個損失,分類的兩個損失,以及mask分支的損失函數。前四個損失函數與fasterrcnn的損失函數一樣,最後的mask損失函數的採用的是mask分支對於每個RoI有Km2 維度的輸出。K個(類別數)解析度為m*m的二值mask。
因此作者利用了a per-pixel sigmoid,並且定義 Lmask 為平均二值交叉熵損失(the average binary cross-entropy loss).
對於一個屬於第k個類別的RoI, Lmask 僅僅考慮第k個mask(其他的掩模輸入不會貢獻到損失函數中)。這樣的定義會允許對每個類別都會生成掩模,並且不會存在類間競爭。# Losses
rpn_class_loss = KL.Lambda(lambda x: rpn_class_loss_graph(*x), name="rpn_class_loss")(
[input_rpn_match, rpn_class_logits])
rpn_bbox_loss = KL.Lambda(lambda x: rpn_bbox_loss_graph(config, *x), name="rpn_bbox_loss")(
[input_rpn_bbox, input_rpn_match, rpn_bbox])
class_loss = KL.Lambda(lambda x: mrcnn_class_loss_graph(*x), name="mrcnn_class_loss")(
[target_class_ids, mrcnn_class_logits, active_class_ids])
bbox_loss = KL.Lambda(lambda x: mrcnn_bbox_loss_graph(*x), name="mrcnn_bbox_loss")(
[target_bbox, target_class_ids, mrcnn_bbox])
mask_loss = KL.Lambda(lambda x: mrcnn_mask_loss_graph(*x), name="mrcnn_mask_loss")(
[target_mask, target_class_ids, mrcnn_mask])1.9--總的模型上面只是介紹了模型的每一步,要把各個模型串聯起來,才可以形成一個整體的網絡,網絡的整體定義如下:
# Model
inputs = [input_image, input_image_meta,
input_rpn_match, input_rpn_bbox, input_gt_class_ids, input_gt_boxes, input_gt_masks]
if not config.USE_RPN_ROIS:
inputs.append(input_rois)
outputs = [rpn_class_logits, rpn_class, rpn_bbox,
mrcnn_class_logits, mrcnn_class, mrcnn_bbox, mrcnn_mask,
rpn_rois, output_rois,
rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, mask_loss]
model = KM.Model(inputs, outputs, name='mask_rcnn')2--模型的訓練在模型訓練的時候,會根據配置文件讀取信息,將數據讀取到一個dataset的對象中,會計算出圖片真實的anchor和 mask在模型的rpn階段計算loss值使用。這個代碼可以設置每次訓練的層數,甚至可以只訓練某基層。訓練的時候最重要的是data_generator的生成:
def data_generator(dataset, config, shuffle=True, augment=False, augmentation=None,
random_rois=0, batch_size=1, detection_targets=False,
no_augmentation_sources=None):
"""A generator that returns images and corresponding target class ids,
bounding box deltas, and masks.
dataset: The Dataset object to pick data from
config: The model config object
shuffle: If True, shuffles the samples before every epoch
augment: (deprecated. Use augmentation instead). If true, apply random
image augmentation. Currently, only horizontal flipping is offered.
augmentation: Optional. An imgaug (https://github.com/aleju/imgaug) augmentation.
For example, passing imgaug.augmenters.Fliplr(0.5) flips images
right/left 50% of the time.
random_rois: If > 0 then generate proposals to be used to train the
network classifier and mask heads. Useful if training
the Mask RCNN part without the RPN.
batch_size: How many images to return in each call
detection_targets: If True, generate detection targets (class IDs, bbox
deltas, and masks). Typically for debugging or visualizations because
in trainig detection targets are generated by DetectionTargetLayer.
no_augmentation_sources: Optional. List of sources to exclude for
augmentation. A source is string that identifies a dataset and is
defined in the Dataset class.
Returns a Python generator. Upon calling next() on it, the
generator returns two lists, inputs and outputs. The contents
of the lists differs depending on the received arguments:
inputs list:
- images: [batch, H, W, C]
- image_meta: [batch, (meta data)] Image details. See compose_image_meta()
- rpn_match: [batch, N] Integer (1=positive anchor, -1=negative, 0=neutral)
- rpn_bbox: [batch, N, (dy, dx, log(dh), log(dw))] Anchor bbox deltas.
- gt_class_ids: [batch, MAX_GT_INSTANCES] Integer class IDs
- gt_boxes: [batch, MAX_GT_INSTANCES, (y1, x1, y2, x2)]
- gt_masks: [batch, height, width, MAX_GT_INSTANCES]. The height and width
are those of the image unless use_mini_mask is True, in which
case they are defined in MINI_MASK_SHAPE.
outputs list: Usually empty in regular training. But if detection_targets
is True then the outputs list contains target class_ids, bbox deltas,
and masks.
"""
b = 0 # batch item index
image_index = -1
image_ids = np.copy(dataset.image_ids)
error_count = 0
no_augmentation_sources = no_augmentation_sources or []
# Anchors
# [anchor_count, (y1, x1, y2, x2)] [[256 256] [128 128][ 64 64][ 32 32][ 16 16]]
backbone_shapes = compute_backbone_shapes(config, config.IMAGE_SHAPE)
anchors = utils.generate_pyramid_anchors(config.RPN_ANCHOR_SCALES,
config.RPN_ANCHOR_RATIOS,
backbone_shapes,
config.BACKBONE_STRIDES,
config.RPN_ANCHOR_STRIDE)
# Keras requires a generator to run indefinitely.
while True:
try:
# Increment index to pick next image. Shuffle if at the start of an epoch.
image_index = (image_index + 1) % len(image_ids)
if shuffle and image_index == 0:
np.random.shuffle(image_ids)
# Get GT bounding boxes and masks for image.
image_id = image_ids[image_index]
# If the image source is not to be augmented pass None as augmentation
if dataset.image_info[image_id]['source'] in no_augmentation_sources:
image, image_meta, gt_class_ids, gt_boxes, gt_masks = \
load_image_gt(dataset, config, image_id, augment=augment,
augmentation=None,
use_mini_mask=config.USE_MINI_MASK)
else:
image, image_meta, gt_class_ids, gt_boxes, gt_masks = \
load_image_gt(dataset, config, image_id, augment=augment,
augmentation=augmentation,
use_mini_mask=config.USE_MINI_MASK)
# Skip images that have no instances. This can happen in cases
# where we train on a subset of classes and the image doesn't
# have any of the classes we care about.
if not np.any(gt_class_ids > 0):
continue
# RPN Targets rpn_match代表anchor是正樣本還是負樣本還是中性樣本
rpn_match, rpn_bbox = build_rpn_targets(image.shape, anchors,
gt_class_ids, gt_boxes, config)
# Mask R-CNN Targets
if random_rois:
rpn_rois = generate_random_rois(
image.shape, random_rois, gt_class_ids, gt_boxes)
if detection_targets:
rois, mrcnn_class_ids, mrcnn_bbox, mrcnn_mask =\
build_detection_targets(
rpn_rois, gt_class_ids, gt_boxes, gt_masks, config)
# Init batch arrays
if b == 0:
batch_image_meta = np.zeros(
(batch_size,) + image_meta.shape, dtype=image_meta.dtype)
batch_rpn_match = np.zeros(
[batch_size, anchors.shape[0], 1], dtype=rpn_match.dtype)
batch_rpn_bbox = np.zeros(
[batch_size, config.RPN_TRAIN_ANCHORS_PER_IMAGE, 4], dtype=rpn_bbox.dtype)
batch_images = np.zeros(
(batch_size,) + image.shape, dtype=np.float32)
batch_gt_class_ids = np.zeros(
(batch_size, config.MAX_GT_INSTANCES), dtype=np.int32)
batch_gt_boxes = np.zeros(
(batch_size, config.MAX_GT_INSTANCES, 4), dtype=np.int32)
batch_gt_masks = np.zeros(
(batch_size, gt_masks.shape[0], gt_masks.shape[1],
config.MAX_GT_INSTANCES), dtype=gt_masks.dtype)
if random_rois:
batch_rpn_rois = np.zeros(
(batch_size, rpn_rois.shape[0], 4), dtype=rpn_rois.dtype)
if detection_targets:
batch_rois = np.zeros(
(batch_size,) + rois.shape, dtype=rois.dtype)
batch_mrcnn_class_ids = np.zeros(
(batch_size,) + mrcnn_class_ids.shape, dtype=mrcnn_class_ids.dtype)
batch_mrcnn_bbox = np.zeros(
(batch_size,) + mrcnn_bbox.shape, dtype=mrcnn_bbox.dtype)
batch_mrcnn_mask = np.zeros(
(batch_size,) + mrcnn_mask.shape, dtype=mrcnn_mask.dtype)
# If more instances than fits in the array, sub-sample from them.
if gt_boxes.shape[0] > config.MAX_GT_INSTANCES:
ids = np.random.choice(
np.arange(gt_boxes.shape[0]), config.MAX_GT_INSTANCES, replace=False)
gt_class_ids = gt_class_ids[ids]
gt_boxes = gt_boxes[ids]
gt_masks = gt_masks[:, :, ids]
# Add to batch
batch_image_meta[b] = image_meta
batch_rpn_match[b] = rpn_match[:, np.newaxis]
batch_rpn_bbox[b] = rpn_bbox
batch_images[b] = mold_image(image.astype(np.float32), config)
batch_gt_class_ids[b, :gt_class_ids.shape[0]] = gt_class_ids
batch_gt_boxes[b, :gt_boxes.shape[0]] = gt_boxes
batch_gt_masks[b, :, :, :gt_masks.shape[-1]] = gt_masks
if random_rois:
batch_rpn_rois[b] = rpn_rois
if detection_targets:
batch_rois[b] = rois
batch_mrcnn_class_ids[b] = mrcnn_class_ids
batch_mrcnn_bbox[b] = mrcnn_bbox
batch_mrcnn_mask[b] = mrcnn_mask
b += 1
# Batch full?
if b >= batch_size:
inputs = [batch_images, batch_image_meta, batch_rpn_match, batch_rpn_bbox,
batch_gt_class_ids, batch_gt_boxes, batch_gt_masks]
outputs = []
if random_rois:
inputs.extend([batch_rpn_rois])
if detection_targets:
inputs.extend([batch_rois])
# Keras requires that output and targets have the same number of dimensions
batch_mrcnn_class_ids = np.expand_dims(
batch_mrcnn_class_ids, -1)
outputs.extend(
[batch_mrcnn_class_ids, batch_mrcnn_bbox, batch_mrcnn_mask])
yield inputs, outputs
# start a new batch
b = 0
except (GeneratorExit, KeyboardInterrupt):
raise
except:
# Log it and skip the image
logging.exception("Error processing image {}".format(
dataset.image_info[image_id]))
error_count += 1
if error_count > 5:
raise3--模型的inference在模型的預測階段,我們調用的是模型的detect方法,首先讀取要預測的圖片。然後調用predict方法
# Run object detection
detections, _, _, mrcnn_mask, _, _, _ =\
self.keras_model.predict([molded_images, image_metas, anchors], verbose=0)得到的預測結果是歸一化以後的結果,預測的mask也是在28*28的大小的結果,因此,要利用雙線性插值的方式將mask的大小轉換到和原始圖片一樣的大小上。
def unmold_detections(self, detections, mrcnn_mask, original_image_shape,
image_shape, window):
"""Reformats the detections of one image from the format of the neural
network output to a format suitable for use in the rest of the
application.
detections: [N, (y1, x1, y2, x2, class_id, score)] in normalized coordinates
mrcnn_mask: [N, height, width, num_classes]
original_image_shape: [H, W, C] Original image shape before resizing
image_shape: [H, W, C] Shape of the image after resizing and padding
window: [y1, x1, y2, x2] Pixel coordinates of box in the image where the real
image is excluding the padding.
Returns:
boxes: [N, (y1, x1, y2, x2)] Bounding boxes in pixels
class_ids: [N] Integer class IDs for each bounding box
scores: [N] Float probability scores of the class_id
masks: [height, width, num_instances] Instance masks
"""
# How many detections do we have?
# Detections array is padded with zeros. Find the first class_id == 0.
zero_ix = np.where(detections[:, 4] == 0)[0]
N = zero_ix[0] if zero_ix.shape[0] > 0 else detections.shape[0]
# Extract boxes, class_ids, scores, and class-specific masks
boxes = detections[:N, :4]
class_ids = detections[:N, 4].astype(np.int32)
scores = detections[:N, 5]
masks = mrcnn_mask[np.arange(N), :, :, class_ids]
# Translate normalized coordinates in the resized image to pixel
# coordinates in the original image before resizing
window = utils.norm_boxes(window, image_shape[:2])
wy1, wx1, wy2, wx2 = window
shift = np.array([wy1, wx1, wy1, wx1])
wh = wy2 - wy1 # window height
ww = wx2 - wx1 # window width
scale = np.array([wh, ww, wh, ww])
# Convert boxes to normalized coordinates on the window
boxes = np.divide(boxes - shift, scale)
# Convert boxes to pixel coordinates on the original image
boxes = utils.denorm_boxes(boxes, original_image_shape[:2])
# Filter out detections with zero area. Happens in early training when
# network weights are still random
exclude_ix = np.where(
(boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
if exclude_ix.shape[0] > 0:
boxes = np.delete(boxes, exclude_ix, axis=0)
class_ids = np.delete(class_ids, exclude_ix, axis=0)
scores = np.delete(scores, exclude_ix, axis=0)
masks = np.delete(masks, exclude_ix, axis=0)
N = class_ids.shape[0]
# Resize masks to original image size and set boundary threshold.
full_masks = []
for i in range(N):
# Convert neural network mask to full size mask
full_mask = utils.unmold_mask(masks[i], boxes[i], original_image_shape)
full_masks.append(full_mask)
full_masks = np.stack(full_masks, axis=-1)\
if full_masks else np.empty(original_image_shape[:2] + (0,))
return boxes, class_ids, scores, full_masks下面的代碼是將神經網絡預測得到的mask轉換到原始的圖片上:
def unmold_mask(mask, bbox, image_shape):
"""Converts a mask generated by the neural network to a format similar
to its original shape.
mask: [height, width] of type float. A small, typically 28x28 mask.
bbox: [y1, x1, y2, x2]. The box to fit the mask in.
Returns a binary mask with the same size as the original image.
"""
threshold = 0.5
y1, x1, y2, x2 = bbox
mask = skimage.transform.resize(mask, (y2 - y1, x2 - x1), order=1, mode="constant")
mask = np.where(mask >= threshold, 1, 0).astype(np.bool)
# Put the mask in the right location.
full_mask = np.zeros(image_shape[:2], dtype=np.bool)
full_mask[y1:y2, x1:x2] = mask
return full_mask