[关闭]
@vivounicorn 2020-03-24T15:34:58.000000Z 字数 67696 阅读 4724

机器学习与人工智能技术分享-第八章 目标检测与识别

第八章 机器学习 目标检测 目标识别

回到目录


8. 目标检测与识别

目标检测的发展历程大致如下:


对于目标识别任务,比如判断一张图片中有没有车、是什么车,一般需要解决两个问题:目标检测、目标识别。而目标检测任务中通常需要先通过某种方法做图像分割,事先得到候选框;直观的做法是:给定窗口,对整张图片滑动扫描,结束后改变窗口大小重复上面步骤,缺点很明显:重复劳动耗费资源、精度和质量不高等等。
针对上面的问题,一种解决方案是借鉴启发式搜索的方法,充分利用人类的先验知识。J.R.R. Uijlings在《Selective Search for Object Recoginition》提出一种方法:基于数据驱动,与具体类别无关的多种策略融合的启发式生成方法。图片包含各种丰富信息,例如:大小、形状、颜色、纹理、物体重叠关系等,如果只使用一种信息往往不能解决大部分问题,例如:


左边的两只猫可以通过颜色区别而不是通过纹理,右面的变色龙却只能通过纹理区别而不是颜色。

8.1.1 启发式生成设计准则

所以概括来说:

基于以上准则设计Selective Search算法:

最终相似度为所有策略加权和,文中采用等权方式:

8.1.3 使用Selective Search做目标识别

训练过程包含:提取候选框、提取特征、生成正负样本、训练模型,图示如下:


早期图像特征提取往往是各种HOG特征或BoW特征,现在CNN特征几乎一统天下。
检测定位效果评价采用Average Best Overlap(ABO)和Mean Average Best Overlap(MABO):


其中:为类别标注、为类别下的ground truth,为通过Selective Search生成的候选框。

8.1.4 代码实践

参见AlpacaDB

  1. # -*- coding: utf-8 -*-
  2. import skimage.io
  3. import skimage.feature
  4. import skimage.color
  5. import skimage.transform
  6. import skimage.util
  7. import skimage.segmentation
  8. import numpy
  9. # "Selective Search for Object Recognition" by J.R.R. Uijlings et al.
  10. #
  11. # - Modified version with LBP extractor for texture vectorization
  12. def _generate_segments(im_orig, scale, sigma, min_size):
  13. """
  14. segment smallest regions by the algorithm of Felzenswalb and
  15. Huttenlocher
  16. """
  17. # open the Image
  18. im_mask = skimage.segmentation.felzenszwalb(
  19. skimage.util.img_as_float(im_orig), scale=scale, sigma=sigma,
  20. min_size=min_size)
  21. # merge mask channel to the image as a 4th channel
  22. im_orig = numpy.append(
  23. im_orig, numpy.zeros(im_orig.shape[:2])[:, :, numpy.newaxis], axis=2)
  24. im_orig[:, :, 3] = im_mask
  25. return im_orig
  26. def _sim_colour(r1, r2):
  27. """
  28. calculate the sum of histogram intersection of colour
  29. """
  30. return sum([min(a, b) for a, b in zip(r1["hist_c"], r2["hist_c"])])
  31. def _sim_texture(r1, r2):
  32. """
  33. calculate the sum of histogram intersection of texture
  34. """
  35. return sum([min(a, b) for a, b in zip(r1["hist_t"], r2["hist_t"])])
  36. def _sim_size(r1, r2, imsize):
  37. """
  38. calculate the size similarity over the image
  39. """
  40. return 1.0 - (r1["size"] + r2["size"]) / imsize
  41. def _sim_fill(r1, r2, imsize):
  42. """
  43. calculate the fill similarity over the image
  44. """
  45. bbsize = (
  46. (max(r1["max_x"], r2["max_x"]) - min(r1["min_x"], r2["min_x"]))
  47. * (max(r1["max_y"], r2["max_y"]) - min(r1["min_y"], r2["min_y"]))
  48. )
  49. return 1.0 - (bbsize - r1["size"] - r2["size"]) / imsize
  50. def _calc_sim(r1, r2, imsize):
  51. return (_sim_colour(r1, r2) + _sim_texture(r1, r2)
  52. + _sim_size(r1, r2, imsize) + _sim_fill(r1, r2, imsize))
  53. def _calc_colour_hist(img):
  54. """
  55. calculate colour histogram for each region
  56. the size of output histogram will be BINS * COLOUR_CHANNELS(3)
  57. number of bins is 25 as same as [uijlings_ijcv2013_draft.pdf]
  58. extract HSV
  59. """
  60. BINS = 25
  61. hist = numpy.array([])
  62. for colour_channel in (0, 1, 2):
  63. # extracting one colour channel
  64. c = img[:, colour_channel]
  65. # calculate histogram for each colour and join to the result
  66. hist = numpy.concatenate(
  67. [hist] + [numpy.histogram(c, BINS, (0.0, 255.0))[0]])
  68. # L1 normalize
  69. hist = hist / len(img)
  70. return hist
  71. def _calc_texture_gradient(img):
  72. """
  73. calculate texture gradient for entire image
  74. The original SelectiveSearch algorithm proposed Gaussian derivative
  75. for 8 orientations, but we use LBP instead.
  76. output will be [height(*)][width(*)]
  77. """
  78. ret = numpy.zeros((img.shape[0], img.shape[1], img.shape[2]))
  79. for colour_channel in (0, 1, 2):
  80. ret[:, :, colour_channel] = skimage.feature.local_binary_pattern(
  81. img[:, :, colour_channel], 8, 1.0)
  82. return ret
  83. def _calc_texture_hist(img):
  84. """
  85. calculate texture histogram for each region
  86. calculate the histogram of gradient for each colours
  87. the size of output histogram will be
  88. BINS * ORIENTATIONS * COLOUR_CHANNELS(3)
  89. """
  90. BINS = 10
  91. hist = numpy.array([])
  92. for colour_channel in (0, 1, 2):
  93. # mask by the colour channel
  94. fd = img[:, colour_channel]
  95. # calculate histogram for each orientation and concatenate them all
  96. # and join to the result
  97. hist = numpy.concatenate(
  98. [hist] + [numpy.histogram(fd, BINS, (0.0, 1.0))[0]])
  99. # L1 Normalize
  100. hist = hist / len(img)
  101. return hist
  102. def _extract_regions(img):
  103. R = {}
  104. # get hsv image
  105. hsv = skimage.color.rgb2hsv(img[:, :, :3])
  106. # pass 1: count pixel positions
  107. for y, i in enumerate(img):
  108. for x, (r, g, b, l) in enumerate(i):
  109. # initialize a new region
  110. if l not in R:
  111. R[l] = {
  112. "min_x": 0xffff, "min_y": 0xffff,
  113. "max_x": 0, "max_y": 0, "labels": [l]}
  114. # bounding box
  115. if R[l]["min_x"] > x:
  116. R[l]["min_x"] = x
  117. if R[l]["min_y"] > y:
  118. R[l]["min_y"] = y
  119. if R[l]["max_x"] < x:
  120. R[l]["max_x"] = x
  121. if R[l]["max_y"] < y:
  122. R[l]["max_y"] = y
  123. # pass 2: calculate texture gradient
  124. tex_grad = _calc_texture_gradient(img)
  125. # pass 3: calculate colour histogram of each region
  126. for k, v in R.items():
  127. # colour histogram
  128. masked_pixels = hsv[:, :, :][img[:, :, 3] == k]
  129. R[k]["size"] = len(masked_pixels / 4)
  130. R[k]["hist_c"] = _calc_colour_hist(masked_pixels)
  131. # texture histogram
  132. R[k]["hist_t"] = _calc_texture_hist(tex_grad[:, :][img[:, :, 3] == k])
  133. return R
  134. def _extract_neighbours(regions):
  135. def intersect(a, b):
  136. if (a["min_x"] < b["min_x"] < a["max_x"]
  137. and a["min_y"] < b["min_y"] < a["max_y"]) or (
  138. a["min_x"] < b["max_x"] < a["max_x"]
  139. and a["min_y"] < b["max_y"] < a["max_y"]) or (
  140. a["min_x"] < b["min_x"] < a["max_x"]
  141. and a["min_y"] < b["max_y"] < a["max_y"]) or (
  142. a["min_x"] < b["max_x"] < a["max_x"]
  143. and a["min_y"] < b["min_y"] < a["max_y"]):
  144. return True
  145. return False
  146. R = regions.items()
  147. neighbours = []
  148. for cur, a in enumerate(R[:-1]):
  149. for b in R[cur + 1:]:
  150. if intersect(a[1], b[1]):
  151. neighbours.append((a, b))
  152. return neighbours
  153. def _merge_regions(r1, r2):
  154. new_size = r1["size"] + r2["size"]
  155. rt = {
  156. "min_x": min(r1["min_x"], r2["min_x"]),
  157. "min_y": min(r1["min_y"], r2["min_y"]),
  158. "max_x": max(r1["max_x"], r2["max_x"]),
  159. "max_y": max(r1["max_y"], r2["max_y"]),
  160. "size": new_size,
  161. "hist_c": (
  162. r1["hist_c"] * r1["size"] + r2["hist_c"] * r2["size"]) / new_size,
  163. "hist_t": (
  164. r1["hist_t"] * r1["size"] + r2["hist_t"] * r2["size"]) / new_size,
  165. "labels": r1["labels"] + r2["labels"]
  166. }
  167. return rt
  168. def selective_search(
  169. im_orig, scale=1.0, sigma=0.8, min_size=50):
  170. '''Selective Search
  171. Parameters
  172. ----------
  173. im_orig : ndarray
  174. Input image
  175. scale : int
  176. Free parameter. Higher means larger clusters in felzenszwalb segmentation.
  177. sigma : float
  178. Width of Gaussian kernel for felzenszwalb segmentation.
  179. min_size : int
  180. Minimum component size for felzenszwalb segmentation.
  181. Returns
  182. -------
  183. img : ndarray
  184. image with region label
  185. region label is stored in the 4th value of each pixel [r,g,b,(region)]
  186. regions : array of dict
  187. [
  188. {
  189. 'rect': (left, top, right, bottom),
  190. 'labels': [...]
  191. },
  192. ...
  193. ]
  194. '''
  195. assert im_orig.shape[2] == 3, "3ch image is expected"
  196. # load image and get smallest regions
  197. # region label is stored in the 4th value of each pixel [r,g,b,(region)]
  198. img = _generate_segments(im_orig, scale, sigma, min_size)
  199. if img is None:
  200. return None, {}
  201. imsize = img.shape[0] * img.shape[1]
  202. R = _extract_regions(img)
  203. # extract neighbouring information
  204. neighbours = _extract_neighbours(R)
  205. # calculate initial similarities
  206. S = {}
  207. for (ai, ar), (bi, br) in neighbours:
  208. S[(ai, bi)] = _calc_sim(ar, br, imsize)
  209. # hierarchal search
  210. while S != {}:
  211. # get highest similarity
  212. i, j = sorted(S.items(), cmp=lambda a, b: cmp(a[1], b[1]))[-1][0]
  213. # merge corresponding regions
  214. t = max(R.keys()) + 1.0
  215. R[t] = _merge_regions(R[i], R[j])
  216. # mark similarities for regions to be removed
  217. key_to_delete = []
  218. for k, v in S.items():
  219. if (i in k) or (j in k):
  220. key_to_delete.append(k)
  221. # remove old similarities of related regions
  222. for k in key_to_delete:
  223. del S[k]
  224. # calculate similarity set with the new region
  225. for k in filter(lambda a: a != (i, j), key_to_delete):
  226. n = k[1] if k[0] in (i, j) else k[0]
  227. S[(t, n)] = _calc_sim(R[t], R[n], imsize)
  228. regions = []
  229. for k, r in R.items():
  230. regions.append({
  231. 'rect': (
  232. r['min_x'], r['min_y'],
  233. r['max_x'] - r['min_x'], r['max_y'] - r['min_y']),
  234. 'size': r['size'],
  235. 'labels': r['labels']
  236. })
  237. return img, regions
  1. # -*- coding: utf-8 -*-
  2. import matplotlib
  3. matplotlib.use("Agg")
  4. import matplotlib.pyplot as plt
  5. import skimage.data
  6. import skimage.io
  7. from skimage.io import use_plugin,imread
  8. import matplotlib.patches as mpatches
  9. from matplotlib.pyplot import savefig
  10. import selectivesearch
  11. def main():
  12. # loading astronaut image
  13. #img = skimage.data.astronaut()
  14. use_plugin('pil')
  15. img = imread('car.jpg', as_grey=False)
  16. # perform selective search
  17. img_lbl, regions = selectivesearch.selective_search(
  18. img, scale=500, sigma=0.9, min_size=10)
  19. candidates = set()
  20. for r in regions:
  21. # excluding same rectangle (with different segments)
  22. if r['rect'] in candidates:
  23. continue
  24. # excluding regions smaller than 2000 pixels
  25. if r['size'] < 2000:
  26. continue
  27. # distorted rects
  28. x, y, w, h = r['rect']
  29. if w / h > 1.2 or h / w > 1.2:
  30. continue
  31. candidates.add(r['rect'])
  32. # draw rectangles on the original image
  33. plt.figure()
  34. fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(6, 6))
  35. ax.imshow(img)
  36. for x, y, w, h in candidates:
  37. print x, y, w, h
  38. rect = mpatches.Rectangle(
  39. (x, y), w, h, fill=False, edgecolor='red', linewidth=1)
  40. ax.add_patch(rect)
  41. #plt.show()
  42. savefig('MyFig.jpg')
  43. if __name__ == "__main__":
  44. main()

car.jpg原图如下:


结果图如下:


8.2 OverFeat

计算机视觉有三大任务:分类(识别)、定位、检测,从左到右每个任务是下个任务的子任务,所以难度递增。OverFeat是2014年《OverFeat:Integrated Recognition, Localization and Detection using Convolutional Networks》中提出的一个基于卷积神经网络的特征提取框架,论文的最大亮点在于通过一个统一的框架去解决图像分类、定位、检测问题,并提出feature map上的一个点可以还原并对应到原图的一个区域,于是一些在原图上的操作可以转到在feature map上做,这点对以后的检测算法有较深远的影响。它在ImageNet 2013的task 3定位任务中获得第一,在检测和分类任务中也有不错的表现。

8.2.1 OverFeat分类任务

文中借鉴了AlexNet的结构,并做了些结构改进和提高了线上inference效率,结构如下:


相对AlexNet,网络结构几乎一样,区别在于:

去掉了LRN层,不做额外归一化操作
使用区域非重叠pooling
前两层使用较小的stride,从而产生较大的feature map,提高了模型精度

a图代表经过第5个卷积层后的feature map有20个神经元,选取stride=3做非重叠pooling,有以下3种方式:(通常我们只使用第一种)

△=0分组:[1,2,3],[4,5,6],[7,8,9],...,[16,17,18]
△=1分组:[2,3,4],[5,6,7],[8,9,10],...,[17,18,19]
△=2分组:[3,4,5],[6,7,8],[9,10,11],...,[18,19,20]

在二维情况下,输入图像在经过FCN及第5个卷积层后得到若干个feature map,使用3x3 filter在feature map上做滑动窗口(注意此时不在原图上做,节省大量计算消耗)。按上图的原理,滑动窗口总共要做9次,从(0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2)处分别滑动。得到的feature map分别经过后面的3个FC层,得到多组特征,最后拼接起来得到最终特征向量并用于分类。

绿色代表卷积核,蓝色代表feature map,当输入大于规定尺寸时,在黄色区域会有额外计算,最终的输出也不是一个值而是一个矩阵,可以用各种策略输出最终结果,比如一种简单做法是用矩阵平均值作为最终分类结果。

8.2.2 OverFeat定位任务

第5层pooling结果作为输入,共256个通道,以FCN的思想理解,先走一个4096通道的全连接层再走一个1024通道的全连接层,与前面类似使用Offet Pooing和滑动窗口对每类生成一个4通道矩阵,4个通道分别代表BB的四条边的坐标。


8.2.3 OverFeat检测任务

与分类类似但需要考虑位置信息,同样采用网络结构共享特征提取,在预测分类中还需要加“背景”这一类。

8.2.4 代码实践

可参见:OverFeat

8.3 R-CNN

过去若干年,目标检测使用的都是滑动窗口的方式,这种方式计算效率较差,另外以往CNN在ImageNet比赛分类问题的表现更加突出,如何利用这些成果以及ImageNet的大量训练数据去借力打力也是一个值得研究的课题。R-CNN由Ross Girshick等人在《Rich feature hierarchies for accurate object detection and semantic segmentation》中提出,OverFeat从某种程度可以看做R-CNN的特例,R-CNN在图像检测领域有很大的影响力,该算法的亮点在于:使用Selective Search代替传统滑动窗口方式生成候选框并使用CNN提取特征;把分类和回归方法同时应用在检测中;当训练数据不足时,通过预训练利用领域数据(知识)做transfer learning,在对象数据集上再应用fine-tuning继续训练。

8.3.1 IoU

IoU(intersection over union),是用来衡量Bounding Box定位精度的指标,它的定义类似Jaccard距离,假设A为人工标定的BB,B为预测的BB则:



8.3.2 NMS

NMS(non-maximum suppression)在目标检测中用来依据置信度消除重叠度过高的重复候选框,从而提高检测算法效率。
例如,原图为:


原图+候选框为:


执行NMS后为:


代码可参考:Non-Maximum Suppression for Object Detection in Python
nms.py

  1. # import the necessary packages
  2. import numpy as np
  3. # Felzenszwalb et al.
  4. def non_max_suppression_slow(boxes, overlapThresh):
  5. # if there are no boxes, return an empty list
  6. if len(boxes) == 0:
  7. return []
  8. # initialize the list of picked indexes
  9. pick = []
  10. # grab the coordinates of the bounding boxes
  11. x1 = boxes[:,0]
  12. y1 = boxes[:,1]
  13. x2 = boxes[:,2]
  14. y2 = boxes[:,3]
  15. scores = boxes[:, 4]
  16. # compute the area of the bounding boxes and sort the bounding
  17. # boxes by the bottom-right y-coordinate of the bounding box
  18. area = (x2 - x1 + 1) * (y2 - y1 + 1)
  19. idxs = np.argsort(scores)
  20. # keep looping while some indexes still remain in the indexes
  21. # list
  22. while len(idxs) > 0:
  23. # grab the last index in the indexes list, add the index
  24. # value to the list of picked indexes, then initialize
  25. # the suppression list (i.e. indexes that will be deleted)
  26. # using the last index
  27. last = len(idxs) - 1
  28. i = idxs[last]
  29. pick.append(i)
  30. suppress = [last]
  31. # loop over all indexes in the indexes list
  32. for pos in xrange(0, last):
  33. # grab the current index
  34. j = idxs[pos]
  35. # find the largest (x, y) coordinates for the start of
  36. # the bounding box and the smallest (x, y) coordinates
  37. # for the end of the bounding box
  38. xx1 = max(x1[i], x1[j])
  39. yy1 = max(y1[i], y1[j])
  40. xx2 = min(x2[i], x2[j])
  41. yy2 = min(y2[i], y2[j])
  42. # compute the width and height of the bounding box
  43. w = max(0, xx2 - xx1 + 1)
  44. h = max(0, yy2 - yy1 + 1)
  45. # compute the ratio of overlap between the computed
  46. # bounding box and the bounding box in the area list
  47. overlap = float(w * h) / area[j]
  48. # if there is sufficient overlap, suppress the
  49. # current bounding box
  50. if overlap > overlapThresh:
  51. suppress.append(pos)
  52. # delete all indexes from the index list that are in the
  53. # suppression list
  54. idxs = np.delete(idxs, suppress)
  55. # return only the bounding boxes that were picked
  56. return boxes[pick]

nms_slow.py

  1. # import the necessary packages
  2. from pyimagesearch.nms import non_max_suppression_slow
  3. import numpy as np
  4. import cv2
  5. # construct a list containing the images that will be examined
  6. # along with their respective bounding boxes
  7. # 最后一位为:分类置信度*100
  8. images = [
  9. ("images/333.jpg", np.array([
  10. (285,293,713,679,96),
  11. (9,309,161,719,90),
  12. (703,259,959,659,93),
  13. (291,309,693,663,90),
  14. (1,371,155,621,80),
  15. (511,347,681,637,89),
  16. (293,587,721,671,70),
  17. (757,469,957,641,60)]))]
  18. # loop over the images
  19. for (imagePath, boundingBoxes) in images:
  20. # load the image and clone it
  21. print "[x] %d initial bounding boxes" % (len(boundingBoxes))
  22. image = cv2.imread(imagePath)
  23. orig = image.copy()
  24. # loop over the bounding boxes for each image and draw them
  25. for (startX, startY, endX, endY, c) in boundingBoxes:
  26. cv2.rectangle(orig, (startX, startY), (endX, endY), (0, 0, 255), 2)
  27. # perform non-maximum suppression on the bounding boxes
  28. pick = non_max_suppression_slow(boundingBoxes, 0.3)
  29. print "[x] after applying non-maximum, %d bounding boxes" % (len(pick))
  30. # loop over the picked bounding boxes and draw them
  31. for (startX, startY, endX, endY,c) in pick:
  32. cv2.rectangle(image, (startX, startY), (endX, endY), (0, 255, 0), 2)
  33. # display the images
  34. cv2.imshow("Original", orig)
  35. cv2.imshow("After NMS", image)
  36. cv2.waitKey(0)

8.3.3 mAP

先介绍什么是AP,以PASCAL VOC CHALLENGE 2010以后的定义做说明。
假设个样本中有个正例,依据包含正例的个数,可以得到个recall值,分别为:,对于每个recall值可以计算出对应的最大precision,然后对这个precision值取平均即得到AP值。
举个例子,假设是否为车的分类,一共有30个测试样本,预测结果及标注如下:

编号 预测值 实际值
1 0.88 1
2 0.76 0
3 0.56 0
4 0.92 0
5 0.10 1
6 0.77 1
7 0.23 0
8 0.34 0
9 0.35 0
10 0.66 1
11 0.56 0
12 0.45 1
13 0.93 1
14 0.97 0
15 0.81 1
16 0.78 0
17 0.66 0
18 0.54 0
19 0.43 1
20 0.31 0
21 0.22 0
22 0.12 0
23 0.02 0
24 0.05 1
25 0.15 0
26 0.01 0
27 0.77 1
28 0.37 0
29 0.43 1
30 0.99 1

按照预测得分降序排列后如下:

编号 预测值 实际值
30 0.99 1
14 0.97 0
13 0.93 1
4 0.92 0
1 0.88 1
15 0.81 1
16 0.78 0
6 0.77 1
27 0.77 1
2 0.76 0
10 0.66 1
17 0.66 0
3 0.56 0
11 0.56 0
18 0.54 0
12 0.45 1
19 0.43 1
29 0.43 1
28 0.37 0
9 0.35 0
8 0.34 0
20 0.31 0
7 0.23 0
21 0.22 0
25 0.15 0
22 0.12 0
5 0.10 1
24 0.05 1
23 0.02 0
26 0.01 0

AP计算过程如下(注意与AUC之间的异同):

编号 预测值 实际值 Precision Recall(r) Max Precision with Recall(r'≥r) AP
30 0.99 1 1/1=1 1/12=0.08 1 0.609
14 0.97 0 1/2=0.5 1/12=0.08
13 0.93 1 2/3=0.67 2/12=0.17 0.67
4 0.92 0 2/4=0.5 2/12=0.17
1 0.88 1 3/5=0.6 3/12=0.25 0.6
15 0.81 1 4/6=0.67 4/12=0.33 0.67
16 0.78 0 4/7=0.57 4/12=0.33
6 0.77 1 5/8=0.63 5/12=0.42 0.63
27 0.77 1 6/9=0.67 6/12=0.5 0.67
2 0.76 0 6/10=0.6 6/12=0.5
10 0.66 1 7/11=0.64 7/12=0.58 0.64
17 0.66 0 7/12=0.58 7/12=0.58
3 0.56 0 7/13=0.54 7/12=0.58
11 0.56 0 7/14=0.5 7/12=0.58
18 0.54 0 7/15=0.47 7/12=0.58
12 0.45 1 8/16=0.5 8/12=0.67 0.5
19 0.43 1 9/17=0.53 9/12=0.75 0.53
29 0.43 1 10/18=0.56 10/12=0.83 0.56
28 0.37 0 10/19=0.53 10/12=0.83
9 0.35 0 10/20=0.5 10/12=0.83
8 0.34 0 10/21=0.48 10/12=0.83
20 0.31 0 10/22=0.45 10/12=0.83
7 0.23 0 10/23=0.43 10/12=0.83
21 0.22 0 10/24=0.42 10/12=0.83
25 0.15 0 10/25=0.4 10/12=0.83
22 0.12 0 10/26=0.38 10/12=0.83
5 0.1 1 11/27=0.41 11/12=0.92 0.41
24 0.05 1 12/28=0.43 12/12=1 0.43
23 0.02 0 12/29=0.41 12/12=1
26 0.01 0 12/30=0.4 12/12=1

mAP是所有类别下的AP求算数平均值的结果。

8.3.4 R-CNN原理

训练阶段 整个过程分4步:


以上四个步骤是相互独立的,后验(马后炮)的来看,可以做这些改进:
1)、把分类和回归放在一个网络做共享特征;
2)、网络结构对输入图片大小自适应;
3)、把候选框生成算法也放在同一个网络来做共享特征;
4)、分类器抛弃SVM直接融合在神经网络中;
5)、不用每个候选框都做一次特征提取。

测试阶段过程如下:

8.3.5 代码实践

作者代码能力极强,具体可见:R-CNN: Region-based Convolutional Neural Networks

8.4 SPP-Net

SPP-Net是何凯明等人在《Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition》一文中提出,文章亮点是主要解决了两个问题:
1、允许CNN网络的输入图片大小不固定(后面的FCN也可以解决这个问题);
2、借鉴OverFeat只对整张图做一次特征提取,一些操作只在feature map上做而不用在原图进行且feature map上的点可以还原到原图上。

8.4.1 问题回顾

之前的CNN网络的输入都是固定大小的,好处是网络结构相对简单和计算量低,坏处是所有图片都需要做预处理,这个会损失原图信息或引入噪声。训练和预测的一般流程是:


常用的缩放方式有裁剪和缩放,例如:


分析CNN网络结构可以发现,卷积层和pooling层对图片输入大小都没有要求,唯独全连接层需要其输入是固定大小的,所以改进主要针对全连接层的输入,另外通过特征可视化观察到feature map包含了图片的空间信息,所以新方法同样需要包含空间信息,于是文中提出了通过增加SPP层解决问题,新的算法流程变为:


8.4.2 SPP详解

可以把这个问题看做如何找到输入可变,输出固定且能保留空间信息的映射问题,问题三个相关变量:feature map的大小、bin的个数(借鉴BoW《Video Google: A Text Retrieval Approach to Object Matching in Videos》的思想,表示固定特征的维度数)、pooling步长。现在feature map的大小不固定但bin的个数固定,于是唯一能自适应可变的就是pooling步长了。
假设:最后一个卷积层产生的feature map大小为,希望产生个bins,则窗口大小为,步长为,例如:


每个bin的pooling方式可以是max pooling或其他pooling。

SPP同样支持多尺度特征,例如4×4、2×2、1×1三种尺度最后拼成21×256维特征向量:


8.4.3 感受野(Receptive Field)

感受野来源于生物学,Levine and Shefner在《Fundamentals of sensation and perception》中将感受野定义为:由于受到刺激导致特定神经元发生反应的区域。比如人在观察某个物体的某个部分时由于受到刺激,物体会投影到视网膜,之后传到给大脑并激活某个区域(橘色的框框住的区域)。


CNN的任何一个卷积层或pooling层产生的任何一个feature map上的任何一点都会对应到原始图像上的某个区域,那个区域就是该点的感受野。例如,红、绿、橙三个点的感受野不同:


感受野的大小与以下两个因素有关但与是否padding无关
1、filter的大小;
2、stride的大小。

8.4.4 feature map与原图对应关系转换

由于SPP只对原图做一次特征提取,省去了大量重复劳动,另外由于特征点的可还原性,使得后续对所有对候选框做SPP特征映射操作时只需要在最后一个卷积层产生的feature map上进行即可(否则需要考虑感受野上的所有特征映射将会产生巨大的计算量)。
详情可参考《R-CNN minus R》.
简单的转换方法为:
需要对CNN网络的所有卷积层和pooling层做padding,使得原图中的任何一点与卷积或pooling后的图上的点一一对应(边缘信息也没有丢失)。
假设:
1、任何一层的核大小为
2、每层padding值为
3、原图中任何一点坐标为,该点在任何一个feature map上的位置为
4、从原图到该feature map感受野范围内的所有stride乘积为
则:
原图候选框左上点的坐标与其在任意feature map上的坐标关系为:


原图候选框右下点的坐标与其在任意feature map上的坐标关系为:

通用的转换方法为:


其中:
是feature map上的特征点感受野的中心位置坐标;
是当前特征点处于由CNN的第几层产生的feature map中;
层的stride大小;
层的filter大小;
层的padding大小。
反过来可以知道原图任何一个候选框在任何一个feature map上的位置。

感受野大小的计算采用Top to Down的方式,从当前层往靠近输入层的方式逐层传递,具体方法为:
假设:待计算感受野的特征点所在feature map所处层为为特征点在原图的感受野大小。
则:

以下面两幅图为例:

8.4.5 代码实践

  1. # -*- coding: utf-8 -*-
  2. #一层表示为一个三元组: [filter size, stride, padding]
  3. import math
  4. def forword(conv, layerIn):
  5. n_in = layerIn
  6. k = conv[0]
  7. s = conv[1]
  8. p = conv[2]
  9. return math.floor((n_in - k + 2*p)/s) + 1
  10. def alexnet():
  11. convnet = [[],[11,4,0],[3,2,0],[5,1,2],[3,2,0],[3,1,1],[3,1,1],[3,1,1],[3,2,0],[6,1,0], [1, 1, 0]]
  12. layer_names = [['input'],'conv1','pool1','conv2','pool2','conv3','conv4','conv5','pool5','fc6-conv', 'fc7-conv']
  13. return [convnet, layer_names]
  14. def testnet():
  15. convnet = [[],[2,1,0],[3,3,1]]
  16. layer_names = [['input'],'conv1','conv2']
  17. return [convnet, layer_names]
  18. # layerid >= 1
  19. def receptivefield(net, layerid):
  20. if layerid > len(net[0]):
  21. print '[error] receptivefield:no such layerid!'
  22. return 0
  23. rf = 1
  24. for i in reversed(range(layerid)):
  25. filtersize, stride, padding = net[0][i+1]
  26. rf = (rf - 1)*stride + filtersize
  27. print ' 感受野大小为:%d.' % (int(rf))
  28. return rf
  29. def anylayerout(net, layerin, layerid):
  30. if layerid > len(net[0]):
  31. print '[error] anylayerout:no such layerid!'
  32. return 0
  33. for i in range(layerid):
  34. if i == 0:
  35. fout = forword(net[0][i+1], layerin)
  36. continue
  37. fout = forword(net[0][i+1], fout)
  38. print '当前层为:%s, 输出节点维度为:%d.' % (net[1][layerid], int(fout))
  39. #x,y>=1
  40. def receptivefieldcenter(net, layerid, x, y):
  41. if layerid > len(net[0]):
  42. print '[error] receptivefieldcenter:no such layerid!'
  43. return 0
  44. al = 1
  45. bl = 1
  46. for i in range(layerid):
  47. filtersize, stride, padding = net[0][i+1]
  48. al = al * stride
  49. ss = 1
  50. for j in range(i):
  51. fsize, std, pad = net[0][j+1]
  52. ss = ss * std
  53. bl = bl + ss * (float(filtersize-1)/2 - padding)
  54. xi0 = al * (x - 1) + float(bl)
  55. yi0 = al * (y - 1) + bl
  56. print ' 该层上的特征点(%d,%d)在原图的感受野中心坐标为:(%.1f,%.1f).' % (int(x), int(y), float(xi0), float(yi0))
  57. return (xi0, yi0)
  58. # net:为某个CNN网络
  59. # insize:为输入层大小
  60. # totallayers:为除了输入层外的所有层个数
  61. # x,y为某层特征点坐标
  62. def printlayer(net, insize, totallayers, x, y):
  63. for i in range(totallayers):
  64. # 计算每一层的输出大小
  65. anylayerout(net, insize, i+1)
  66. # 计算每层的感受野大小
  67. receptivefield(net, i+1)
  68. # 计算feature map上(x,y)点在原图感受野的中心位置坐标
  69. receptivefieldcenter(net, i+1, x, y)
  70. if __name__ == '__main__':
  71. #net = testnet()
  72. #printlayer(net, insize=6, totallayers=2, x=1, y=1)
  73. net = alexnet()
  74. printlayer(net, insize=227, totallayers=8, x=2, y=3)

8.5 Fast R-CNN

Fast R-CNN》的出现解决了R-CNN+SPP中的以下问题:

8.5.1 算法概述

算法基本步骤为:


直观对比R-CNN与Fast R-CNN的forward pipeline


8.5.2 训练阶段

smooth L1函数对异常点不敏感(在|x|值较大时使用线性分段函数而不是二次函数),如图:


8.5.3 代码实践

fast r-cnn完整代码请参考rbgirshick/fast-rcnn

  1. // ------------------------------------------------------------------
  2. // Fast R-CNN
  3. // Copyright (c) 2015 Microsoft
  4. // Licensed under The MIT License [see fast-rcnn/LICENSE for details]
  5. // Written by Ross Girshick
  6. // ------------------------------------------------------------------
  7. #include <cfloat>
  8. #include "caffe/fast_rcnn_layers.hpp"
  9. using std::max;
  10. using std::min;
  11. namespace caffe {
  12. template <typename Dtype>
  13. // 以下参数解释以VGG16为例,即进入roi pooling前的网络结构采用经典VGG16.
  14. // 在Layer类中输入数据用bottom表示, 输出数据用top表示
  15. __global__ void ROIPoolForward(
  16. const int nthreads, // 任务数,对应通过roi pooling后的输出feature map的神经元节点总数,
  17. // 具体为:RoI的个数(m) × channel个数(VGG16的conv5_3的输出为512个) × roi pooling输出宽(配置为7) × roi pooling输出高(配置为7) = 25088×m个
  18. const Dtype* bottom_data, // 输入的feature map,原图经过各种卷积、pooling等前向传播后得到(VGG16的conv5_3卷积产生的feature map,大小为:512×14×14)
  19. const Dtype spatial_scale, // 由之前所有卷积层的strides相乘得到,在fast rcnn中为1/16,注:从原图往conv5_3的feature map上映射为缩小过程,所以乘以1/16,反之需要乘以16
  20. const int channels, // 输入层(VGG16为卷积层conv5_3)feature map的channel个数(512)
  21. const int height, // 输入层(VGG16为卷积层conv5_3)feature map的高(14)
  22. const int width, // 输入层(VGG16为卷积层conv5_3)feature map的宽(14)
  23. const int pooled_height, // roi pooling输出feature map的高,fast rcnn中配置为h=7
  24. const int pooled_width, // roi pooling输出feature map的宽,fast rcnn中配置为w=7
  25. const Dtype* bottom_rois, // 输入的roi信息,存储所有rois或一个batch的rois,数据结构为[batch_ind,x1,y1,x2,y2],包含roi的:索引、左上角坐标及右下角坐标
  26. Dtype* top_data, // 存储roi pooling后得到的feature map
  27. int* argmax_data) { // 为每个roi pooling后的feature map元素存储max pooling后对应conv5_3 feature map元素的索引信息,长度等于nthreads
  28. // index为线程索引,个数为roi pooling后的feature map上所有值的个数,索引范围为:[0,nthreads-1]
  29. CUDA_KERNEL_LOOP(index, nthreads) {
  30. // 该线程对应的top blob(N,C,H,W)中的W,输出roi pooling后feature map的中的宽的坐标,即feature map的第i=[0,k-1]列
  31. int pw = index % pooled_width;
  32. // 该线程对应的top blob(N,C,H,W)中的H,输出roi pooling后feature map的中的高的坐标,即feature map的第j=[0,k-1]行
  33. int ph = (index / pooled_width) % pooled_height;
  34. // 该线程对应的top blob(N,C,H,W)中的C,即第c个channel,channel数最大值为输入feature map的channel数(VGG16中为512).
  35. int c = (index / pooled_width / pooled_height) % channels;
  36. // 该线程对应的是第几个RoI,一共m个.
  37. int n = index / pooled_width / pooled_height / channels;
  38. // [start, end),指定RoI信息的存储范围,指针每次移动5的倍数是因为包含信息的数据结构大小为5,包含信息为:[batch_ind,x1,y1,x2,y2],含义同上
  39. bottom_rois += n * 5;
  40. // 将每个原图的RoI区域映射到feature map(VGG16为conv5_3产生的feature mao)上的坐标,bottom_rois第0个位置存放的是roi索引.
  41. int roi_batch_ind = bottom_rois[0];
  42. // 原图到feature map的映射为乘以1/16,这里采用粗映射而不是上文讲的精确映射,原因你懂的.
  43. int roi_start_w = round(bottom_rois[1] * spatial_scale);
  44. int roi_start_h = round(bottom_rois[2] * spatial_scale);
  45. int roi_end_w = round(bottom_rois[3] * spatial_scale);
  46. int roi_end_h = round(bottom_rois[4] * spatial_scale);
  47. // 强制把RoI的宽和高限制在1x1,防止出现映射后的RoI大小为0的情况
  48. int roi_width = max(roi_end_w - roi_start_w + 1, 1);
  49. int roi_height = max(roi_end_h - roi_start_h + 1, 1);
  50. // 根据原图映射得到的roi的高和配置的roi pooling的高(这里大小配置为7)自适应计算bin桶的高度
  51. Dtype bin_size_h = static_cast<Dtype>(roi_height)
  52. / static_cast<Dtype>(pooled_height);
  53. // 根据原图映射得到的roi的宽和配置的roi pooling的宽(这里大小配置为7)自适应计算bin桶的宽度
  54. Dtype bin_size_w = static_cast<Dtype>(roi_width)
  55. / static_cast<Dtype>(pooled_width);
  56. // 计算第(i,j)个bin桶在feature map上的坐标范围,需要依据它们确定后续max pooling的范围
  57. int hstart = static_cast<int>(floor(static_cast<Dtype>(ph)
  58. * bin_size_h));
  59. int wstart = static_cast<int>(floor(static_cast<Dtype>(pw)
  60. * bin_size_w));
  61. int hend = static_cast<int>(ceil(static_cast<Dtype>(ph + 1)
  62. * bin_size_h));
  63. int wend = static_cast<int>(ceil(static_cast<Dtype>(pw + 1)
  64. * bin_size_w));
  65. // 确定max pooling具体范围,注意由于RoI取自原图,其左上角不是从(0,0)开始,
  66. // 所以需要加上 roi_start_h 或 roi_start_w作为偏移量,并且超出feature map尺寸范围的部分会被舍弃
  67. hstart = min(max(hstart + roi_start_h, 0), height);
  68. hend = min(max(hend + roi_start_h, 0), height);
  69. wstart = min(max(wstart + roi_start_w, 0), width);
  70. wend = min(max(wend + roi_start_w, 0), width);
  71. bool is_empty = (hend <= hstart) || (wend <= wstart);
  72. // 如果区域为0返回错误代码
  73. Dtype maxval = is_empty ? 0 : -FLT_MAX;
  74. // If nothing is pooled, argmax = -1 causes nothing to be backprop'd
  75. int maxidx = -1;
  76. bottom_data += (roi_batch_ind * channels + c) * height * width;
  77. // 在给定bin桶的区域中做max pooling
  78. for (int h = hstart; h < hend; ++h) {
  79. for (int w = wstart; w < wend; ++w) {
  80. int bottom_index = h * width + w;
  81. if (bottom_data[bottom_index] > maxval) {
  82. maxval = bottom_data[bottom_index];
  83. maxidx = bottom_index;
  84. }
  85. }
  86. }
  87. // 为某个roi pooling的feature map元素记录其由对conv5_3(VGG16)的feature map做max pooling后产生元素的索引号及值
  88. top_data[index] = maxval;
  89. argmax_data[index] = maxidx;
  90. }
  91. }
  92. template <typename Dtype>
  93. void ROIPoolingLayer<Dtype>::Forward_gpu(
  94. const vector<Blob<Dtype>*>& bottom, // 以VGG16为例,bottom[0]为最后一个卷积层conv5_3产生的feature map,shape[1, 512, 14, 14],
  95. // bottom[1]为rois数据,shape[roi个数m, 5]
  96. const vector<Blob<Dtype>*>& top) { // top为输出层结构, top->count() = top.n(RoI的个数) × top.channel(channel数)
  97. // × top.w(输出feature map的宽) × top.h(输出feature map的高)
  98. const Dtype* bottom_data = bottom[0]->gpu_data();
  99. const Dtype* bottom_rois = bottom[1]->gpu_data();
  100. Dtype* top_data = top[0]->mutable_gpu_data();
  101. int* argmax_data = max_idx_.mutable_gpu_data();
  102. int count = top[0]->count();
  103. /*
  104. 参照caffe-fast-rcnn/src/caffe/layers/roi_pooling_layer.cpp中的代码:
  105. template <typename Dtype>
  106. void ROIPoolingLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
  107. const vector<Blob<Dtype>*>& top) {
  108. channels_ = bottom[0]->channels();
  109. height_ = bottom[0]->height();
  110. width_ = bottom[0]->width();
  111. top[0]->Reshape(bottom[1]->num(), channels_, pooled_height_, pooled_width_);
  112. max_idx_.Reshape(bottom[1]->num(), channels_, pooled_height_, pooled_width_);
  113. }*/
  114. /*
  115. 参照caffe-fast-rcnn/include/caffe/util/device_alternate.hpp中的代码:
  116. // CUDA_KERNEL_LOOP
  117. #define CUDA_KERNEL_LOOP(i, n) \
  118. for (int i = blockIdx.x * blockDim.x + threadIdx.x; \
  119. i < (n); \
  120. i += blockDim.x * gridDim.x)
  121. // CAFFE_GET_BLOCKS
  122. // CUDA: number of blocks for threads.
  123. inline int CAFFE_GET_BLOCKS(const int N) {
  124. return (N + CAFFE_CUDA_NUM_THREADS - 1) / CAFFE_CUDA_NUM_THREADS;
  125. }
  126. // CAFFE_CUDA_NUM_THREADS
  127. // CUDA: thread number configuration.
  128. // Use 1024 threads per block, which requires cuda sm_2x or above,
  129. // or fall back to attempt compatibility (best of luck to you).
  130. #if __CUDA_ARCH__ >= 200
  131. const int CAFFE_CUDA_NUM_THREADS = 1024;
  132. #else
  133. const int CAFFE_CUDA_NUM_THREADS = 512;
  134. #endif
  135. */
  136. ROIPoolForward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
  137. count, bottom_data, spatial_scale_, channels_, height_, width_,
  138. pooled_height_, pooled_width_, bottom_rois, top_data, argmax_data);
  139. CUDA_POST_KERNEL_CHECK;
  140. }
  141. template <typename Dtype>
  142. // 反向传播的过程与论文中"Back-propagation through RoI pooling layers"这一小节的公式完全一致
  143. __global__ void ROIPoolBackward(
  144. const int nthreads, // 输入feature map的元素数(VGG16为:512×14×14)
  145. const Dtype* top_diff, // roi pooling输出feature map所带的梯度信息∂L/∂y(r,j)
  146. const int* argmax_data, // 同前向,不解释
  147. const int num_rois, // 同前向,不解释
  148. const Dtype spatial_scale, // 同前向,不解释
  149. const int channels, // 同前向,不解释
  150. const int height, // 同前向,不解释
  151. const int width, // 同前向,不解释
  152. const int pooled_height, // 同前向,不解释
  153. const int pooled_width, // 同前向,不解释
  154. Dtype* bottom_diff, // 保留输入feature map每个元素通过梯度反向传播得到的梯度信息
  155. const Dtype* bottom_rois) { // 同前向,不解释
  156. // 含义同前向,需要注意的是这里表示的是输入feature map的元素数(反向传播嘛)
  157. CUDA_KERNEL_LOOP(index, nthreads) {
  158. // 同前向,不解释
  159. int w = index % width;
  160. int h = (index / width) % height;
  161. int c = (index / width / height) % channels;
  162. int n = index / width / height / channels;
  163. Dtype gradient = 0;
  164. // 同论文中公式,任何一个输入feature map的元素的梯度信息为:
  165. // 所有max pooling时被该元素落入且该元素值被选中(最大值)的
  166. // roi pooling feature map元素的梯度信息累加和
  167. // 遍历所有RoI,以判断是否满足上述条件
  168. for (int roi_n = 0; roi_n < num_rois; ++roi_n) {
  169. const Dtype* offset_bottom_rois = bottom_rois + roi_n * 5;
  170. int roi_batch_ind = offset_bottom_rois[0];
  171. // 如果RoI的索引号不满足条件则跳过
  172. if (n != roi_batch_ind) {
  173. continue;
  174. }
  175. // 找原图RoI在feature map上的映射位置,解释同前向传播
  176. int roi_start_w = round(offset_bottom_rois[1] * spatial_scale);
  177. int roi_start_h = round(offset_bottom_rois[2] * spatial_scale);
  178. int roi_end_w = round(offset_bottom_rois[3] * spatial_scale);
  179. int roi_end_h = round(offset_bottom_rois[4] * spatial_scale);
  180. // (h,w)不在RoI范围则跳过
  181. const bool in_roi = (w >= roi_start_w && w <= roi_end_w &&
  182. h >= roi_start_h && h <= roi_end_h);
  183. if (!in_roi) {
  184. continue;
  185. }
  186. int offset = (roi_n * channels + c) * pooled_height * pooled_width;
  187. const Dtype* offset_top_diff = top_diff + offset;
  188. const int* offset_argmax_data = argmax_data + offset;
  189. // 同前向
  190. int roi_width = max(roi_end_w - roi_start_w + 1, 1);
  191. int roi_height = max(roi_end_h - roi_start_h + 1, 1);
  192. // 同前向
  193. Dtype bin_size_h = static_cast<Dtype>(roi_height)
  194. / static_cast<Dtype>(pooled_height);
  195. Dtype bin_size_w = static_cast<Dtype>(roi_width)
  196. / static_cast<Dtype>(pooled_width);
  197. // 类比前向,看做一个逆过程
  198. int phstart = floor(static_cast<Dtype>(h - roi_start_h) / bin_size_h);
  199. int phend = ceil(static_cast<Dtype>(h - roi_start_h + 1) / bin_size_h);
  200. int pwstart = floor(static_cast<Dtype>(w - roi_start_w) / bin_size_w);
  201. int pwend = ceil(static_cast<Dtype>(w - roi_start_w + 1) / bin_size_w);
  202. phstart = min(max(phstart, 0), pooled_height);
  203. phend = min(max(phend, 0), pooled_height);
  204. pwstart = min(max(pwstart, 0), pooled_width);
  205. pwend = min(max(pwend, 0), pooled_width);
  206. // 累积所有与当前输入feature map上的元素相关的roi pooling元素的梯度信息
  207. for (int ph = phstart; ph < phend; ++ph) {
  208. for (int pw = pwstart; pw < pwend; ++pw) {
  209. if (offset_argmax_data[ph * pooled_width + pw] == (h * width + w)) {
  210. gradient += offset_top_diff[ph * pooled_width + pw];
  211. }
  212. }
  213. }
  214. }
  215. // 存储当前输入feature map上元素的反向传播梯度信息
  216. bottom_diff[index] = gradient;
  217. }
  218. }
  219. template <typename Dtype>
  220. void ROIPoolingLayer<Dtype>::Backward_gpu(
  221. const vector<Blob<Dtype>*>& top, // roi pooling输出feature map
  222. const vector<bool>& propagate_down, // 是否做反向传播,回忆前向传播时的那个bool值
  223. const vector<Blob<Dtype>*>& bottom) { // roi pooling输入feature map(VGG16中的conv5_3产生的feature map)
  224. if (!propagate_down[0]) {
  225. return;
  226. }
  227. const Dtype* bottom_rois = bottom[1]->gpu_data(); // 原始RoI信息
  228. const Dtype* top_diff = top[0]->gpu_diff(); // roi pooling feature map梯度信息
  229. Dtype* bottom_diff = bottom[0]->mutable_gpu_diff(); // 待写入的输入feature map梯度信息
  230. const int count = bottom[0]->count(); // 输入feature map元素总数
  231. caffe_gpu_set(count, Dtype(0.), bottom_diff);
  232. const int* argmax_data = max_idx_.gpu_data();
  233. // NOLINT_NEXT_LINE(whitespace/operators)
  234. ROIPoolBackward<Dtype><<<CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS>>>(
  235. count, top_diff, argmax_data, top[0]->num(), spatial_scale_, channels_,
  236. height_, width_, pooled_height_, pooled_width_, bottom_diff, bottom_rois);
  237. CUDA_POST_KERNEL_CHECK;
  238. }
  239. INSTANTIATE_LAYER_GPU_FUNCS(ROIPoolingLayer);
  240. } // namespace caffe

实现代码参考,GPU版本:roi_pooling_layer.cu和CPU版本:roi_pooling_layer.cpp

conv5_3及roi相关层配置:

  1. layer {
  2. name: "conv5_3"
  3. type: "Convolution"
  4. bottom: "conv5_2"
  5. top: "conv5_3"
  6. param {
  7. lr_mult: 1
  8. }
  9. param {
  10. lr_mult: 2
  11. }
  12. convolution_param {
  13. num_output: 512
  14. pad: 1
  15. kernel_size: 3
  16. }
  17. }
  18. layer {
  19. name: "relu5_3"
  20. type: "ReLU"
  21. bottom: "conv5_3"
  22. top: "conv5_3"
  23. }
  24. layer {
  25. name: "roi_pool5"
  26. type: "ROIPooling"
  27. bottom: "conv5_3"
  28. bottom: "rois"
  29. top: "pool5"
  30. roi_pooling_param {
  31. pooled_w: 7
  32. pooled_h: 7
  33. spatial_scale: 0.0625 # 1/16
  34. }
  35. }

8.6 Faster R-CNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks》提出了Region Proposal Network(RPN),解决了基于Region的检测算法需要事先通过Selective Search生成候选框的问题,让候选框生成、分类、bounding box回归公用同一套特征提取网络,从而使这类检测算法真正意义上实现End to End。

8.6.1 算法概述

如上所述,Faster R-CNN设计了RPN使得候选框生成可以共用特征提取网络,算法流程如下:


RPN负责生成Proposal候选框,其他过程类似Fast R-CNN,同样,生成候选框的扫描过程发生在最后一个卷积层产生的feature map上(而不是扫描原图),通过之前讲的坐标换算关系可以将feature map任意一点映射回原图。

8.6.2 RPN

RPN的结构如下:


1、RPN的输入是特征提取器最后一个卷积(pooling)产生的feature map,例如VGG16为conv5_3产生的512维(channel数)的feature map(图中例子是256维);
2、之后以m×m大小的滑动窗口扫描feature map,如果feature map大小为h×w,则扫描h×w次(即以每个像素点为中心做一次),文中m的取值为3,取值与具体网络结构有关,感受野的不同导致候选框的初始大小不同;
3、每做一次滑动窗口会生成k个初始候选框,初始候选框的大小与anchor(原理8.6.3解释)有关,中心点为滑动窗口中心点,即对一次滑动窗口行为,所有利用anchor生成的候选框都有相同的中心点(图中蓝点),一定注意:这里的anchor及利用它生成的候选框都是相对于原图的位置
4、定义两个分支,第一个分支(左边)是一个二分类器,用来区分当前候选框是否为物体,如果有k个由anchor生成的候选框,则输出2*k个值(2维向量为:[是物体的概率,是背景的概率]);第二个分支(右边)为回归器,用来回归候选框的中心点坐标和宽与高(4维向量[x,y,w,h]),如果有k个由anchor生成的候选框,则输出4*k个值,显然这里候选框的生成要短、平、快,精调细选由后续网络来做。

8.6.3 Anchor

RPN里很重要的一个概念是anchor,可以把它理解为生成候选框的模板,在RPN里只生成一次,anchor是用原图为参照物,以(0,0,指定宽,指定高)四元组采用不同缩放比例和尺度后产生的候选框模板集合,而候选框由滑动窗口(中心点x,中心点y)利用anchor生成。也可以从逆SPP角度去理解,SPP可以把一个feature map通过多尺度变换为金字塔式的多个feature map,反过来任何一个feature map也可利用多尺度变成多个feature map,这么做的好处是压根儿不用在原图上做各种尺度缩放而只用在feature map上做就好,并且这种变换具有不变性(Translation-Invariant Anchor):候选框生成及其预测函数具有可复现性,例如通过k-means聚类得到800个anchor,如果重复做一次实验不一定还是原来那800个,这个性质可以降低模型大小以及过拟合的风险。

以16×16大小为,base anchor[0,0,15,15]为例:
1、只使用_ratio_enum生成候选框如下:


2、只使用_scale_enum生成候选框如下:


3、混合使用生成候选框如下:
这种模板生成只需要做一次,之后大家以此为基准做中心点漂移即可。(所有其他像素点横纵坐标总是大于0的)


代码可参考generate_anchors.py:

  1. # --------------------------------------------------------
  2. # Faster R-CNN
  3. # Copyright (c) 2015 Microsoft
  4. # Licensed under The MIT License [see LICENSE for details]
  5. # Written by Ross Girshick and Sean Bell
  6. # --------------------------------------------------------
  7. import numpy as np
  8. # Verify that we compute the same anchors as Shaoqing's matlab implementation:
  9. #
  10. # >> load output/rpn_cachedir/faster_rcnn_VOC2007_ZF_stage1_rpn/anchors.mat
  11. # >> anchors
  12. #
  13. # anchors =
  14. #
  15. # -83 -39 100 56
  16. # -175 -87 192 104
  17. # -359 -183 376 200
  18. # -55 -55 72 72
  19. # -119 -119 136 136
  20. # -247 -247 264 264
  21. # -35 -79 52 96
  22. # -79 -167 96 184
  23. # -167 -343 184 360
  24. #array([[ -83., -39., 100., 56.],
  25. # [-175., -87., 192., 104.],
  26. # [-359., -183., 376., 200.],
  27. # [ -55., -55., 72., 72.],
  28. # [-119., -119., 136., 136.],
  29. # [-247., -247., 264., 264.],
  30. # [ -35., -79., 52., 96.],
  31. # [ -79., -167., 96., 184.],
  32. # [-167., -343., 184., 360.]])
  33. # 生成多尺度anchors,默认实现是大小为16,起始anchor位置是(0, 0, 15, 15)[左下角和右上角坐标],宽高比例为1/2,1,2,尺度缩放倍数为8,16,32。
  34. def generate_anchors(base_size=16, ratios=[0.5, 1, 2],
  35. scales=2**np.arange(3, 6)):
  36. """
  37. Generate anchor (reference) windows by enumerating aspect ratios X
  38. scales wrt a reference (0, 0, 15, 15) window.
  39. """
  40. # 生成起始anchor位置是(0, 0, 15, 15)
  41. base_anchor = np.array([1, 1, base_size, base_size]) - 1
  42. # 枚举1/2,1,2三种宽高缩放比例
  43. ratio_anchors = _ratio_enum(base_anchor, ratios)
  44. # 在以上比例的基础上做8,16,32三类尺度缩放,最终生成9个anchor。
  45. anchors = np.vstack([_scale_enum(ratio_anchors[i, :], scales)
  46. for i in xrange(ratio_anchors.shape[0])])
  47. return anchors
  48. # 对给定anchor返回宽、高和中心点坐标(anchor存储的是左下角和右上角)
  49. def _whctrs(anchor):
  50. """
  51. Return width, height, x center, and y center for an anchor (window).
  52. """
  53. w = anchor[2] - anchor[0] + 1
  54. h = anchor[3] - anchor[1] + 1
  55. x_ctr = anchor[0] + 0.5 * (w - 1)
  56. y_ctr = anchor[1] + 0.5 * (h - 1)
  57. return w, h, x_ctr, y_ctr
  58. # 给定宽、高和中心点,输出anchor的左下角和右上角坐标
  59. def _mkanchors(ws, hs, x_ctr, y_ctr):
  60. """
  61. Given a vector of widths (ws) and heights (hs) around a center
  62. (x_ctr, y_ctr), output a set of anchors (windows).
  63. """
  64. ws = ws[:, np.newaxis]
  65. hs = hs[:, np.newaxis]
  66. anchors = np.hstack((x_ctr - 0.5 * (ws - 1),
  67. y_ctr - 0.5 * (hs - 1),
  68. x_ctr + 0.5 * (ws - 1),
  69. y_ctr + 0.5 * (hs - 1)))
  70. return anchors
  71. # 枚举anchor的三种宽高比 1:2,1:1,2:1
  72. def _ratio_enum(anchor, ratios):
  73. """
  74. Enumerate a set of anchors for each aspect ratio wrt an anchor.
  75. """
  76. w, h, x_ctr, y_ctr = _whctrs(anchor)
  77. size = w * h
  78. size_ratios = size / ratios
  79. ws = np.round(np.sqrt(size_ratios))
  80. hs = np.round(ws * ratios)
  81. anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
  82. return anchors
  83. # 枚举anchor的各种尺度,如:anchor为[0 0 15 15],尺度为[8 16 32]
  84. def _scale_enum(anchor, scales):
  85. """
  86. Enumerate a set of anchors for each scale wrt an anchor.
  87. """
  88. w, h, x_ctr, y_ctr = _whctrs(anchor)
  89. ws = w * scales
  90. hs = h * scales
  91. anchors = _mkanchors(ws, hs, x_ctr, y_ctr)
  92. return anchors
  93. if __name__ == '__main__':
  94. import time
  95. t = time.time()
  96. a = generate_anchors()
  97. print time.time() - t
  98. print a
  99. from IPython import embed; embed()

8.6.4 代码实践

集中介绍RPN中proposal层的实现,以特征提取网络采用VGG16在poscal_voc数据集上为例。

  1. layer {
  2. name: "rpn_conv/3x3"
  3. type: "Convolution"
  4. bottom: "conv5_3"
  5. top: "rpn/output"
  6. param { lr_mult: 1.0 }
  7. param { lr_mult: 2.0 }
  8. convolution_param {
  9. num_output: 512
  10. kernel_size: 3 pad: 1 stride: 1
  11. weight_filler { type: "gaussian" std: 0.01 }
  12. bias_filler { type: "constant" value: 0 }
  13. }
  14. }
  15. layer {
  16. name: "rpn_relu/3x3"
  17. type: "ReLU"
  18. bottom: "rpn/output"
  19. top: "rpn/output"
  20. }
  21. layer {
  22. name: "rpn_cls_score"
  23. type: "Convolution"
  24. bottom: "rpn/output"
  25. top: "rpn_cls_score"
  26. param { lr_mult: 1.0 }
  27. param { lr_mult: 2.0 }
  28. convolution_param {
  29. num_output: 18 # 2(bg/fg) * 9(anchors)
  30. kernel_size: 1 pad: 0 stride: 1
  31. weight_filler { type: "gaussian" std: 0.01 }
  32. bias_filler { type: "constant" value: 0 }
  33. }
  34. }
  35. layer {
  36. name: "rpn_bbox_pred"
  37. type: "Convolution"
  38. bottom: "rpn/output"
  39. top: "rpn_bbox_pred"
  40. param { lr_mult: 1.0 }
  41. param { lr_mult: 2.0 }
  42. convolution_param {
  43. num_output: 36 # 4 * 9(anchors)
  44. kernel_size: 1 pad: 0 stride: 1
  45. weight_filler { type: "gaussian" std: 0.01 }
  46. bias_filler { type: "constant" value: 0 }
  47. }
  48. }
  49. layer {
  50. bottom: "rpn_cls_score"
  51. top: "rpn_cls_score_reshape"
  52. name: "rpn_cls_score_reshape"
  53. type: "Reshape"
  54. reshape_param { shape { dim: 0 dim: 2 dim: -1 dim: 0 } }
  55. }
  56. layer {
  57. name: 'rpn-data'
  58. type: 'Python'
  59. bottom: 'rpn_cls_score'
  60. bottom: 'gt_boxes'
  61. bottom: 'im_info'
  62. bottom: 'data'
  63. top: 'rpn_labels'
  64. top: 'rpn_bbox_targets'
  65. top: 'rpn_bbox_inside_weights'
  66. top: 'rpn_bbox_outside_weights'
  67. python_param {
  68. module: 'rpn.anchor_target_layer'
  69. layer: 'AnchorTargetLayer'
  70. param_str: "'feat_stride': 16"
  71. }
  72. }
  73. layer {
  74. name: "rpn_loss_cls"
  75. type: "SoftmaxWithLoss"
  76. bottom: "rpn_cls_score_reshape"
  77. bottom: "rpn_labels"
  78. propagate_down: 1
  79. propagate_down: 0
  80. top: "rpn_cls_loss"
  81. loss_weight: 1
  82. loss_param {
  83. ignore_label: -1
  84. normalize: true
  85. }
  86. }
  87. layer {
  88. name: "rpn_loss_bbox"
  89. type: "SmoothL1Loss"
  90. bottom: "rpn_bbox_pred"
  91. bottom: "rpn_bbox_targets"
  92. bottom: 'rpn_bbox_inside_weights'
  93. bottom: 'rpn_bbox_outside_weights'
  94. top: "rpn_loss_bbox"
  95. loss_weight: 1
  96. smooth_l1_loss_param { sigma: 3.0 }
  97. }
  1. def setup(self, bottom, top):
  2. # parse the layer parameter string, which must be valid YAML
  3. layer_params = yaml.load(self.param_str_)
  4. # 获取所有特征提取层stride的乘积。(例如VGG为16)
  5. self._feat_stride = layer_params['feat_stride']
  6. # 设置初始尺度变换比例为8、16、32。
  7. anchor_scales = layer_params.get('scales', (8, 16, 32))
  8. # 使用上面介绍的方法生成anchor模板。
  9. self._anchors = generate_anchors(scales=np.array(anchor_scales))
  10. # anchor数量。(例如:9)
  11. self._num_anchors = self._anchors.shape[0]
  12. if DEBUG:
  13. print 'feat_stride: {}'.format(self._feat_stride)
  14. print 'anchors:'
  15. print self._anchors
  16. # rois blob: holds R regions of interest, each is a 5-tuple
  17. # (n, x1, y1, x2, y2) specifying an image batch index n and a
  18. # rectangle (x1, y1, x2, y2)
  19. top[0].reshape(1, 5)
  20. # scores blob: holds scores for R regions of interest
  21. if len(top) > 1:
  22. top[1].reshape(1, 1, 1, 1)

以i为中心利用anchor模板生成anchor过程如下(蓝色为模板,用红色为i中心点生成):


实现上就是中心点i的各个坐标直接加到anchor模板的各个坐标即可(anchor模板是以0为中心点的),代码类似:

  1. A = self._num_anchors
  2. K = shifts.shape[0]
  3. anchors = self._anchors.reshape((1, A, 4)) + \
  4. shifts.reshape((1, K, 4)).transpose((1, 0, 2))
  5. anchors = anchors.reshape((K * A, 4))

8.6.5 Faster R-CNN训练流程

采用四阶段交替方式训练(4-Step Alternating Training)
1、使用ImageNet预训练模型权重初始化并fine-tuned训练一个RPN;
2、使用ImageNet预训练模型权重初始化并将上一步产生的候选框(proposal)作为输入训练独立的Faster R-CNN检测模型(此时没有卷积网络共享);
3、生成新的RPN并使用上一步Fast-RCNN模型参数初始化,设置RPN、Fast-RCNN共享的那部分网络权重不做更新,只fine-tuned训练RPN独有的网络层,达到两者共享用于提取特征的卷积层的目的;
4、固定共享的那些卷积层权重,只训练Fast-RCNN独有的网络层。
Faster R-CNN是效果最好的目标检测与分类模型之一,但如果想用于实时监测和前置到客户端则需要做大量模型裁剪、压缩和优化工作,具体做法我以后介绍,目前我们做的比较初步,模型大小压缩到10m左右,准确率损失小于1.5%,线上inference响应时间在500k左右大小图片、k80单机单卡单次请求下为20ms左右(在高并发情况下会通过打batch的方式及其他方法提高并发量)。
未做优化的汽车检测demo:



output.swf3507.1kB

8.6.6 Faster R-CNN with Caffe

源码地址:Faster R-CNN(rbgirshick版)。一定注意,caffe有个问题(我认为是架构上的设计缺陷,这个问题tensorflow就没有):由于要支持自定义的网络层之类的需求,每个人的caffe版本可能是不一样的,所以在编译时需要注意,比如这里的caffe必须使用0dcd397这个branch,否则编译不通过,因为这里有自定义的proposal层以及相关参数。
目录结构如下:



Centos 7上编译运行caffe及Faster R-CNN

8、安装gflags

git clone https://github.com/gflags/gflags
cd gflags
mkdir build && cd build
export CXXFLAGS="-fPIC" && cmake ..
make VERBOSE=1 -j
sudo make install

9、安装glog

git clone https://github.com/google/glog
cd glog
./autogen.sh && ./configure && make && make install

10、安装 lmdb

git clone https://github.com/LMDB/lmdb
cd lmdb/libraries/liblmdb
make -j
sudo make install

11、安装 hdf5

wget https://support.hdfgroup.org/ftp/HDF5/current18/src/hdf5-1.8.19.tar.gz
tar -xvf hdf5-1.8.19.tar.gz
cd hdf5-1.8.19
./configure --prefix=/usr/local
make -j
sudo make install

12、安装 leveldb

git clone https://github.com/google/leveldb
cd leveldb
make -j
sudo cp out-shared/libleveldb.so* /usr/local/lib
sudo cp out-static/.a /usr/local/lib
sudo cp -r include/
/usr/local/include

1、下载源码

cd py-faster-rcnn
git clone https://github.com/rbgirshick/caffe-fast-rcnn.git

检查文件/src/caffe/proto/caffe.proto是否与下面文件一致:
caffe.proto54.1kB

2、修改配置

cd caffe-fast-rcnn
cp Makefile.config.example Makefile.config
vim Makefile.config

修改它的几个地方:
1)、指定CUDA_DIR,如:CUDA_DIR := /usr/local/cuda
2)、BLAS := open
3)、WITH_PYTHON_LAYER := 1

3、编译caffe-fast-rcnn

make clean
make all -j
make test -j
make runtest -j
make pycaffe -j

4、编译py-faster-rcnn的lib

cd py-faster-rcnn/lib/
make

5、配置环境变量
vim ~/.bashrc

export PYTHONPATH=/data/liyiran/py-R-FCN/tools/python:$PYTHONPATH
source ~/.bashrc

1、下载pascal_voc数据集

cd py-faster-rcnn/data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
tar -xvf VOCtrainval_06-Nov-2007.tar
mv VOCtrainval_06-Nov-2007 VOCdevkit2007

2、下载预训练模型

cd py-faster-rcnn/model
wget https://dl.dropboxusercontent.com/s/gstw7122padlf0l/imagenet_models.tgz?dl=0

3、使用VGG16,应用于pascal_voc 2007数据集

sh experiments/scripts/faster_rcnn_end2end.sh 1 VGG16 pascal_voc

8.7 R-FCN

回想之前所有基于Region的检测算法,有一个共同点是:整个网络被分成两部分:共享计算的、与Region无关的全卷积子网络和RoI Pooling之后不共享计算的、与Region相关的子网络(如RPN和BBox Regression网络)。再回想之前所有的分类网络,尤其到残差和GoogLeNet系列,都可以看做是全卷积网络,且在分类问题上的效果已经非常赞了,但当把这些网络直接用于检测问题时,效果往往特别差,甚至不如VGG-16,原因也是明确的:分类问题往往会忽略位置信息,只需要判断是否为某个物体,所以要求提取出来的特征具有平移不变性,不管图片特征放大、缩小还是位移都能很好的适应,而卷积操作、pooling操作都能较好的保持这个性质,并且网络越深模型越对位置不敏感;但在检测问题中,提取的特征还需要能敏锐的捕捉到位置信息,即具备平移变化性,这就尴尬了。为此,大家插入类似RoI Pooling这样的层结构,一方面是的任意大小图片都可以输入,更重要的是一定程度上弥补了位置信息的缺失,所以检测效果也就嗖嗖的上来了。但带来一个副作用是:RoI后每个Region都需要跑一遍后续子网络,计算不共享就导致训练和Inference的速度慢,为此代季峰、何凯明几位提出《R-FCN: Object Detection via Region-based Fully Convolutional Networks》检测框架,用Position-Sensitive RoI Pooling代替原来的RoI Pooling,共享了所有计算,很好的tradeoff了平移不变性和平移变化性,并且由于是全卷积,训练和Inference的速度更快。
以ResNet-101为例,图片来源


8.7.1 算法概述

1、核心思想
如上所述,算法核心就是position-sentitive RoI pooling的加入,核心思想是这样的:


这里的feature map是过去RoI Pooling前的全卷积特征提取子网络,之后接着的(彩色立方体)是position-sensitive feature map,它其实是一个普通的卷积层,权重通过position-sensitive RoI Pooling层反向传播时修正。假设position-sensitive feature map(后面简写为ps feature map)的大小为k×k,检测分类数为C+1(1为背景类),则ps feature map的通道数为:k×k×(C+1),假如K=3,则每一类的 ps feature map会有k×k=9个,每个feature map含有一类位置特征(如:左上、左中、左右、......,下右,图中用不同颜色代表);接着,通过ps RoI Pooling后,每个RoI Region在C+1的每一类上都会得到一个k×k网格,对每个网格做分类判断,之后所有网格一起投票。最终得到C+1维向量,然后接个softmax做分类。

2、整体结构
考虑RPN子网络,整体结构是这样的:


对RPN来说也是类似,每个Bounding Box候选框的位置为一类(左上角坐标、长和宽),ps feature map的通道数为k×k×4。

3、position-sensitive feature map
以ResNet-101作为基础网络结构为例,做以下结构上的更改:

为了显示编码位置信息,假如ps feature map网格大小k×k,RoI大小为:,则每个bin大小约为:,对于第(i,j)个bin()做ps RoI Pooling为:


其中:

  • 为第c类在第(i,j)个bin的pooling响应值;
  • 为是k×k×(C+1)个feature map中的一个;
  • 为RoI的左上角坐标;
  • 是当前bin中的像素数;
  • 是网络所有可学习参数;
  • x、y的取值范围为:
  • pooling采用average、max甚至其他自定义的操作。
    损失函数的定义

4、损失函数定义
由分类部分和回归部分损失组成:


其中:

  • 是每一类的label,代表背景类;
  • ,是交叉熵损失函数;
  • ,与Fast R-CNN的定义一致;

5、可视化效果
预测正例:


预测负例:


8.7.2 position-sentitive RoI pooling


8.7.3 模型训练

1、训练使用Online Hard Example Mining
OHEM是一种boosting策略,目的是使得训练更加高效,简单说,它不是使用简单的抽样策略,而是对容易判断的样本做抑制,对模型不容易判断的样本重复添加。
在检测中,正样本定义为:与ground-truth的,反之为负样本,应用过程为:

2、训练参数

8.7.4 代码实践

源码可在py-R-FCN下载,需要把下载R-FCN版本caffe,编译方式类似Faster RCNN,目录类似:


  1. // ------------------------------------------------------------------
  2. // R-FCN
  3. // Copyright (c) 2016 Microsoft
  4. // Licensed under The MIT License [see r-fcn/LICENSE for details]
  5. // Written by Yi Li
  6. // ------------------------------------------------------------------
  7. #include <cfloat>
  8. #include "caffe/rfcn_layers.hpp"
  9. #include "caffe/util/gpu_util.cuh"
  10. using std::max;
  11. using std::min;
  12. namespace caffe {
  13. template <typename Dtype>
  14. __global__ void PSROIPoolingForward(
  15. const int nthreads, // 任务数,对应通过roi pooling后的输出feature map的神经元节点总数,RoI的个数(m) × channel个数(21类) × psroi pooling输出宽(配置为7) × psroi pooling输出高(配置为7) = 1029×m
  16. const Dtype* bottom_data, // 输入的feature map,原图经过各种卷积、pooling等前向传播后得到(ResNet50rfcn_cls卷积产生的position sensitive feature map,大小为:1029×14×14
  17. const Dtype spatial_scale, // 由之前所有卷积层的strides相乘得到,在rfcn中为1/16,注:从原图往rfcn_clsfeature map上映射为缩小过程,所以乘以1/16,反之需要乘以16
  18. const int channels, // 输入层(ResNet50为卷积层rfcn_clsfeature mapchannel个数(k×k×(C+1)=7×7×21=1029)
  19. const int height, // feature map的宽度(14)
  20. const int width, // feature map的高度(14)
  21. const int pooled_height, // psroi pooling输出feature map的高,fast rcnn中配置为h=7
  22. const int pooled_width, // psroi pooling输出feature map的宽,fast rcnn中配置为w=7
  23. const Dtype* bottom_rois, // 输入的roi信息,存储所有rois或一个batchrois,数据结构为[batch_ind,x1,y1,x2,y2],包含roi的:索引、左上角坐标及右下角坐标
  24. const int output_dim, // 输出feature map的维度,psroipooled_cls_rois2121个类别),psroipooled_loc_rois8
  25. const int group_size, // k=7
  26. Dtype* top_data, // 存储psroi pooling后得到的feature map
  27. int* mapping_channel) {
  28. // index为线程索引,个数为psroi pooling后的feature map上所有值的个数,索引范围为:[0,nthreads-1]
  29. CUDA_KERNEL_LOOP(index, nthreads) {
  30. // 该线程对应的top blobN,C,H,W)中的W,输出roi poolingfeature map的中的宽的坐标,即feature map的第i=[0,k-1]列
  31. int pw = index % pooled_width;
  32. // 该线程对应的top blobN,C,H,W)中的H,输出roi poolingfeature map的中的高的坐标,即feature map的第j=[0,k-1]行
  33. int ph = (index / pooled_width) % pooled_height;
  34. // 该线程对应的top blobN,C,H,W)中的C,即第cchannelchannel数最大值为21(包含背景类的类别数)
  35. int ctop = (index / pooled_width / pooled_height) % output_dim;
  36. // 该线程对应的是第几个RoI,一共m个.
  37. int n = index / pooled_width / pooled_height / output_dim;
  38. // [start, end),指定RoI信息的存储范围,指针每次移动5的倍数是因为包含信息的数据结构大小为5,包含信息为:[batch_ind,x1,y1,x2,y2],含义同上
  39. bottom_rois += n * 5;
  40. // 将每个原图的RoI区域映射到feature map(VGG16conv5_3产生的feature mao)上的坐标,bottom_rois0个位置存放的是roi索引.
  41. int roi_batch_ind = bottom_rois[0];
  42. // 原图到feature map的映射为乘以1/16,这里采用粗映射而不是上文讲的精确映射,原因你懂的.
  43. Dtype roi_start_w = static_cast<Dtype>(round(bottom_rois[1])) * spatial_scale;
  44. Dtype roi_start_h = static_cast<Dtype>(round(bottom_rois[2])) * spatial_scale;
  45. Dtype roi_end_w = static_cast<Dtype>(round(bottom_rois[3]) + 1.) * spatial_scale;
  46. Dtype roi_end_h = static_cast<Dtype>(round(bottom_rois[4]) + 1.) * spatial_scale;
  47. // 强制把RoI的宽和高限制在1x1,防止出现映射后的RoI大小为0的情况
  48. Dtype roi_width = max(roi_end_w - roi_start_w, 0.1);
  49. Dtype roi_height = max(roi_end_h - roi_start_h, 0.1);
  50. // 根据原图映射得到的roi的高和配置的psroi pooling的高(这里大小配置为7)自适应计算bin桶的高度
  51. Dtype bin_size_h = roi_height / static_cast<Dtype>(pooled_height);
  52. // 根据原图映射得到的roi的宽和配置的psroi pooling的宽(这里大小配置为7)自适应计算bin桶的宽度
  53. Dtype bin_size_w = roi_width / static_cast<Dtype>(pooled_width);
  54. // 计算第(i,j)个bin桶在feature map上的坐标范围,需要依据它们确定后续pooling的范围
  55. int hstart = floor(static_cast<Dtype>(ph) * bin_size_h
  56. + roi_start_h);
  57. int wstart = floor(static_cast<Dtype>(pw)* bin_size_w
  58. + roi_start_w);
  59. int hend = ceil(static_cast<Dtype>(ph + 1) * bin_size_h
  60. + roi_start_h);
  61. int wend = ceil(static_cast<Dtype>(pw + 1) * bin_size_w
  62. + roi_start_w);
  63. // 确定max pooling具体范围,注意由于RoI取自原图,其左上角不是从(0,0)开始,
  64. // 所以需要加上 roi_start_h roi_start_w作为偏移量,并且超出feature map尺寸范围的部分会被舍弃
  65. hstart = min(max(hstart, 0), height);
  66. hend = min(max(hend, 0), height);
  67. wstart = min(max(wstart, 0),width);
  68. wend = min(max(wend, 0), width);
  69. bool is_empty = (hend <= hstart) || (wend <= wstart);
  70. int gw = pw;
  71. int gh = ph;
  72. // 计算第C类的(ph,pw)位置索引 = ctop×group_size×group_size + gh×gh×group_size + gw
  73. // 例如: ps feature map上第C[=1]类的第(i,j)[=(1,1)]位置,c=1×7×7 + 1×1×7+1=57
  74. int c = (ctop*group_size + gh)*group_size + gw;
  75. // 逐层做average pooling
  76. bottom_data += (roi_batch_ind * channels + c) * height * width;
  77. Dtype out_sum = 0;
  78. for (int h = hstart; h < hend; ++h){
  79. for (int w = wstart; w < wend; ++w){
  80. int bottom_index = h*width + w;
  81. out_sum += bottom_data[bottom_index];
  82. }
  83. }
  84. // 计算第(i,j)bin桶在feature map上的面积
  85. Dtype bin_area = (hend - hstart)*(wend - wstart);
  86. // 若第(i,j)bin桶宽高非法则设置为0,否则为平均值
  87. top_data[index] = is_empty? 0. : out_sum/bin_area;
  88. // 记录此次迭代计算ps feature map上的索引位置
  89. mapping_channel[index] = c;
  90. }
  91. }
  92. template <typename Dtype>
  93. void PSROIPoolingLayer<Dtype>::Forward_gpu(
  94. const vector<Blob<Dtype>*>& bottom, // ResNet50为例,bottom[0]为最后一个卷积层rfcn_cls产生的feature mapshape[1, 1029, 14, 14],
  95. // bottom[1]为rois数据,shape[roi个数m, 5]
  96. const vector<Blob<Dtype>*>& top) { // top为输出层结构, top->count() = top.nRoI的个数) × top.channel(channel数)
  97. // × top.w(输出feature map的宽) × top.h(输出feature map的高)
  98. const Dtype* bottom_data = bottom[0]->gpu_data();
  99. const Dtype* bottom_rois = bottom[1]->gpu_data();
  100. Dtype* top_data = top[0]->mutable_gpu_data();
  101. int* mapping_channel_ptr = mapping_channel_.mutable_gpu_data();
  102. int count = top[0]->count();
  103. caffe_gpu_set(count, Dtype(0), top_data);
  104. caffe_gpu_set(count, -1, mapping_channel_ptr);
  105. // NOLINT_NEXT_LINE(whitespace/operators)
  106. PSROIPoolingForward<Dtype> << <CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS >> >(
  107. count, bottom_data, spatial_scale_, channels_, height_, width_, pooled_height_,
  108. pooled_width_, bottom_rois, output_dim_, group_size_, top_data, mapping_channel_ptr);
  109. CUDA_POST_KERNEL_CHECK;
  110. }
  111. template <typename Dtype>
  112. __global__ void PSROIPoolingBackwardAtomic(
  113. const int nthreads, // 输入feature map的元素数
  114. const Dtype* top_diff, // psroi pooling输出feature map所带的梯度信息∂L/∂y(r,j)
  115. const int* mapping_channel, // 同前向,不解释
  116. const int num_rois, // 同前向,不解释
  117. const Dtype spatial_scale, // 同前向,不解释
  118. const int channels, // 同前向,不解释
  119. const int height, // 同前向,不解释
  120. const int width, // 同前向,不解释
  121. const int pooled_height, // 同前向,不解释
  122. const int pooled_width, // 同前向,不解释
  123. const int output_dim, // 同前向,不解释
  124. Dtype* bottom_diff, // 保留输入feature map每个元素通过梯度反向传播得到的梯度信息
  125. const Dtype* bottom_rois) { // 同前向,不解释
  126. // 含义同前向,需要注意的是这里表示的是输入feature map的元素数(反向传播嘛)
  127. CUDA_KERNEL_LOOP(index, nthreads) {
  128. // 同前向,不解释
  129. int pw = index % pooled_width;
  130. int ph = (index / pooled_width) % pooled_height;
  131. int n = index / pooled_width / pooled_height / output_dim;
  132. // 找原图RoIfeature map上的映射位置,解释同前向传播
  133. bottom_rois += n * 5;
  134. int roi_batch_ind = bottom_rois[0];
  135. Dtype roi_start_w = static_cast<Dtype>(round(bottom_rois[1])) * spatial_scale;
  136. Dtype roi_start_h = static_cast<Dtype>(round(bottom_rois[2])) * spatial_scale;
  137. Dtype roi_end_w = static_cast<Dtype>(round(bottom_rois[3]) + 1.) * spatial_scale;
  138. Dtype roi_end_h = static_cast<Dtype>(round(bottom_rois[4]) + 1.) * spatial_scale;
  139. // 同前向
  140. Dtype roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0
  141. Dtype roi_height = max(roi_end_h - roi_start_h, 0.1);
  142. // 同前向
  143. Dtype bin_size_h = roi_height / static_cast<Dtype>(pooled_height);
  144. Dtype bin_size_w = roi_width / static_cast<Dtype>(pooled_width);
  145. int hstart = floor(static_cast<Dtype>(ph)* bin_size_h
  146. + roi_start_h);
  147. int wstart = floor(static_cast<Dtype>(pw)* bin_size_w
  148. + roi_start_w);
  149. int hend = ceil(static_cast<Dtype>(ph + 1) * bin_size_h
  150. + roi_start_h);
  151. int wend = ceil(static_cast<Dtype>(pw + 1) * bin_size_w
  152. + roi_start_w);
  153. // 同前向
  154. hstart = min(max(hstart, 0), height);
  155. hend = min(max(hend, 0), height);
  156. wstart = min(max(wstart, 0), width);
  157. wend = min(max(wend, 0), width);
  158. bool is_empty = (hend <= hstart) || (wend <= wstart);
  159. // 计算第Cps feature map权重值,梯度信息会被平均分配
  160. int c = mapping_channel[index];
  161. Dtype* offset_bottom_diff = bottom_diff + (roi_batch_ind * channels + c) * height * width;
  162. Dtype bin_area = (hend - hstart)*(wend - wstart);
  163. Dtype diff_val = is_empty ? 0. : top_diff[index] / bin_area;
  164. for (int h = hstart; h < hend; ++h){
  165. for (int w = wstart; w < wend; ++w){
  166. int bottom_index = h*width + w;
  167. caffe_gpu_atomic_add(diff_val, offset_bottom_diff + bottom_index);
  168. }
  169. }
  170. }
  171. }
  172. template <typename Dtype>
  173. void PSROIPoolingLayer<Dtype>::Backward_gpu(
  174. const vector<Blob<Dtype>*>& top, // psroi pooling输出feature map
  175. const vector<bool>& propagate_down, // 是否做反向传播,回忆前向传播时的那个bool
  176. const vector<Blob<Dtype>*>& bottom) { // psroi pooling输入feature map(ResNet中的rfcn_cls产生的feature map)
  177. if (!propagate_down[0]) {
  178. return;
  179. }
  180. const Dtype* bottom_rois = bottom[1]->gpu_data(); // 原始RoI信息
  181. const Dtype* top_diff = top[0]->gpu_diff(); // psroi pooling feature map梯度信息
  182. Dtype* bottom_diff = bottom[0]->mutable_gpu_diff(); // 待写入的输入feature map梯度信息
  183. const int bottom_count = bottom[0]->count(); // 输入feature map元素总数
  184. const int* mapping_channel_ptr = mapping_channel_.gpu_data();
  185. caffe_gpu_set(bottom[1]->count(), Dtype(0), bottom[1]->mutable_gpu_diff());
  186. caffe_gpu_set(bottom_count, Dtype(0), bottom_diff);
  187. const int count = top[0]->count();
  188. // NOLINT_NEXT_LINE(whitespace/operators)
  189. PSROIPoolingBackwardAtomic<Dtype> << <CAFFE_GET_BLOCKS(count), CAFFE_CUDA_NUM_THREADS >> >(
  190. count, top_diff, mapping_channel_ptr, top[0]->num(), spatial_scale_,
  191. channels_, height_, width_, pooled_height_, pooled_width_, output_dim_,
  192. bottom_diff, bottom_rois);
  193. CUDA_POST_KERNEL_CHECK;
  194. }
  195. INSTANTIATE_LAYER_GPU_FUNCS(PSROIPoolingLayer);
  196. } // namespace caffe
  1. #!/usr/bin/env python
  2. # -*- coding: utf-8 -*- 2
  3. """
  4. Demo script showing detections in sample images.
  5. See README.md for installation instructions before running.
  6. """
  7. import matplotlib
  8. matplotlib.use('Agg')
  9. import matplotlib.pyplot as plt
  10. import _init_paths
  11. from fast_rcnn.config import cfg
  12. from fast_rcnn.test import im_detect
  13. from fast_rcnn.nms_wrapper import nms
  14. from utils.timer import Timer
  15. import numpy as np
  16. import scipy.io as sio
  17. import caffe, os, sys, cv2
  18. import argparse
  19. CLASSES = ('__background__',
  20. 'aeroplane', 'bicycle', 'bird', 'boat',
  21. 'bottle', 'bus', 'car', 'cat', 'chair',
  22. 'cow', 'diningtable', 'dog', 'horse',
  23. 'motorbike', 'person', 'pottedplant',
  24. 'sheep', 'sofa', 'train', 'tvmonitor')
  25. NETS = {'ResNet-101': ('ResNet-101',
  26. 'resnet101_rfcn_final.caffemodel'),
  27. 'ResNet-50': ('ResNet-50',
  28. 'resnet50_rfcn_final.caffemodel')}
  29. def parse_args():
  30. """Parse input arguments."""
  31. parser = argparse.ArgumentParser(description='Faster R-CNN demo')
  32. parser.add_argument('--gpu', dest='gpu_id', help='GPU device id to use [0]',
  33. default=0, type=int)
  34. parser.add_argument('--cpu', dest='cpu_mode',
  35. help='Use CPU mode (overrides --gpu)',
  36. action='store_true')
  37. parser.add_argument('--net', dest='demo_net', help='Network to use [ResNet-101]',
  38. choices=NETS.keys(), default='ResNet-101')
  39. args = parser.parse_args()
  40. return args
  41. def vis_square(data, i):
  42. """Take an array of shape (n, height, width) or (n, height, width, 3)
  43. and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)"""
  44. # normalize data for display
  45. data = (data - data.min()) / (data.max() - data.min())
  46. # force the number of filters to be square
  47. n = int(np.ceil(np.sqrt(data.shape[0])))
  48. padding = (((0, n ** 2 - data.shape[0]),
  49. (0, 1), (0, 1)) # add some space between filters
  50. + ((0, 0),) * (data.ndim - 3)) # don't pad the last dimension (if there is one)
  51. data = np.pad(data, padding, mode='constant', constant_values=1) # pad with ones (white)
  52. # tile the filters into an image
  53. data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1)))
  54. data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:])
  55. plt.imshow(data); plt.axis('off')
  56. plt.savefig('feature-' + str(i) + '.jpg')
  57. def vis_demo(net, image_name):
  58. """可视化位置敏感特征图."""
  59. # Load the demo image
  60. im_file = os.path.join(cfg.DATA_DIR, 'demo', image_name)
  61. im = cv2.imread(im_file)
  62. # Detect all object classes and regress object bounds
  63. timer = Timer()
  64. timer.tic()
  65. scores, boxes = im_detect(net, im)
  66. timer.toc()
  67. print ('Detection took {:.3f}s for '
  68. '{:d} object proposals').format(timer.total_time, boxes.shape[0])
  69. conv = net.blobs['data'].data[0]
  70. ave = np.average(conv.transpose(1, 2, 0), axis=2)
  71. plt.imshow(ave); plt.axis('off')
  72. plt.savefig('featurex.jpg')
  73. # Visualize detections for each class
  74. CONF_THRESH = 0.8
  75. NMS_THRESH = 0.3
  76. for cls_ind, cls in enumerate(CLASSES[1:]):
  77. cls_ind += 1 # because we skipped background
  78. cls_boxes = boxes[:, 4:8]
  79. cls_scores = scores[:, cls_ind]
  80. dets = np.hstack((cls_boxes,
  81. cls_scores[:, np.newaxis])).astype(np.float32)
  82. keep = nms(dets, NMS_THRESH)
  83. dets = dets[keep, :]
  84. print cls_ind, ' ', cls
  85. # rfcn_cls[0, 0:49] 是第0类的7×7map,rfcn_cls[0, 49:98] 是第1类的7×7map,以此类推。
  86. feat = net.blobs['rfcn_cls'].data[0, cls_ind*49:(cls_ind+1)*49]
  87. vis_square(feat, cls)
  88. if __name__ == '__main__':
  89. cfg.TEST.HAS_RPN = True # Use RPN for proposals
  90. args = parse_args()
  91. prototxt = os.path.join(cfg.MODELS_DIR, NETS[args.demo_net][0],
  92. 'rfcn_end2end', 'test_agnostic.prototxt')
  93. caffemodel = os.path.join(cfg.DATA_DIR, 'rfcn_models',
  94. NETS[args.demo_net][33])
  95. if not os.path.isfile(caffemodel):
  96. raise IOError(('{:s} not found.\n').format(caffemodel))
  97. if args.cpu_mode:
  98. caffe.set_mode_cpu()
  99. else:
  100. caffe.set_mode_gpu()
  101. caffe.set_device(args.gpu_id)
  102. cfg.GPU_ID = args.gpu_id
  103. net = caffe.Net(prototxt, caffemodel, caffe.TEST)
  104. for layer_name, blob in net.blobs.iteritems():
  105. print layer_name + '\t' + str(blob.data.shape)
  106. print '\n\nLoaded network {:s}'.format(caffemodel)
  107. # Warmup on a dummy image
  108. im = 128 * np.ones((300, 500, 3), dtype=np.uint8)
  109. for i in xrange(2):
  110. _, _= im_detect(net, im)
  111. im_names = ['car.jpg']
  112. for im_name in im_names:
  113. print '~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~'
  114. print 'Demo for data/demo/{}'.format(im_name)
  115. vis_demo(net, im_name)
  116. # obtain the output probabilities
  117. output_prob = net.blobs['cls_prob'].data[0]
  118. print 'probabilities:'
  119. print output_prob

8.8 DenseNet

8.8.1 关于神经网络的深度

理论上,当我们有足够大量的数据,能够完全体现当前问题的数据分布的时候,我们仅需要一个简单线性模型或最多用个有单隐层的RBF神经网络就可以完美建模。但实际情况是没有那么多数据,那就自然需要一个高复杂度的模型来拟合样本,但如果模型复杂度过高而样本数没有与其达到某种关系,又会造成其泛化性低下,所谓过拟合的问题。实际上,假设未来做testing的数据分布和training的数据分布是一致的,一个有个神经网络节点、个权重、线性阈值函数的前馈神经网络在泛化误差的前提下,训练数据规模的下界是:,详情可见论文《What Size Net Gives Valid Generalization》。
网络的深度则反映了模型的复杂度,深度直接决定了层数而间接影响了节点数和权重数,网络深度的增加意味着能得到更多的抽象特征,但原始输入信号和梯度信息会随着网络深度的增加而消失或无用,所以这又是一个折中权衡,像之前讲的Highway Network、ResNet及其衍生等等模型的思路是通过一个short path的连接让前一层的信号能够传递到后一层,我认为这个思路是开创性的。

8.8.2 DenseNet思路

Densely Connected Convolutional Networks》(CVPR 2017的最佳论文之一)提出的DenseNet则把ResNet的思路做的更加彻底:在一个Dense Block中,任意一个当前层都会与其后面的所有层直接连接,如图:


假如包括当前层在内后面还有层,那么从当前层往后产生的直接连接数为:
回顾之前对ResNet的分析以及《Deep Networks with Stochastic Depth》这篇论文的实验,可以得到以下信息:

  • 神经网络不一定非得是逐层递进的,任意一层可以接收它前面任意一层的输入而扔掉它前面的其它层,也就是说当前层feature map的提取可以只依赖更前面层的feature map;
  • 传统前馈神经网络架构可以被看做是有个状态维护机制,在层与层之间传递这个状态,后一层在接收前一层的状态后又加入自己的信息,修改状态后传给下一层;
  • ResNet网络在路径选择的思想下展开(见ResNet一章的分析)后,其实也说明它有一定的冗余性,适当的随机Dropout一些层相当于扔掉了一些路径,实际实验看还会提高网络Inference的泛化性。

基于以上认知,作者设计了DenseNet:让每一层都与后面所有层直接连接,达到特征复用的目的;同时这些连接也可以看做网络的全局状态,大家共同维护,不用传来传去;降低每一层feature map数,让网络结构变“窄”,达到去除冗余的目的。

与ResNet比较:

  • ResNet采用按照向量每个维度的Element-wise做加和的方式处理连接,而DenseNet采用按照每个通道的Channel-wise做直接向量拼接的方式处理连接。


    PS:注意图中C操作符的位置
    DenseNet的前向传播过程可以像这样展开:


    每一层的输入都包含所有前面层的feature map。
    形式化的对比如下:
    ResNet:第层的输出是
    DenseNet:第层的输出是
    其中:为向量拼接操作,是一个复合函数,文中是batch normalization (BN)+rectified linear
    unit (ReLU)+3×3 convolution (Conv)的复合——BN(ReLU(Conv(x)))。

  • dense blocks与transition layer
    DenseNet的拼接操作要求保证feature map大小具有一致性,但由于pooling下采样操作的存在一定会改变feature map的,所以作者用dense blocks+transition layers的方式解决问题:
    1、dense blocks内部feature map大小都一致,借鉴Inception结构,利用bottleneck中的1×1卷积降低通道数,即BN+ReLU+Conv(1x1)+BN+ReLU+Conv(3x3)操作;
    2、dense blocks之间增加transition layer,同样借鉴Inception结构,利用1×1卷积降低通道数,即BN+ReLU+Conv(1×1)+AvgPooling(2x2)操作:


    transition layer可以起到压缩模型的作用:假设dense block有个feature map,我们让紧接着的transition layer产生,这里为压缩系数。
    宏观来看,整个DenseNet如下:

  • 利用Growth Rate和复合函数,DenseNet可以做的很“窄”:


    假设每个复合函数产生个feature map,那么第层的输入feature map数为: ,可见越往后的dense block输入feature map越多,当然由于全局feature map的存在,每层只有 个feature map是独有的,其余的都共享。
    显然,“窄”的好处是参数少、计算效率高,比较如下:

  • DenseNet结构使得特征更加具有多样性
    显然,由于从高到低引入了不同复杂度的特征,使得最终做预测的特征具有很强的多样性,提高模型的泛化性和鲁棒性。

8.8.3 代码实践

看一个基于keras的简单例子,比较好重现了DenseNet的构建,看的时候对照着DenseNet的前向展开图更好理解原理:

  1. # -*- coding: utf-8 -*-
  2. import keras
  3. import keras.backend as K
  4. from keras.models import Model
  5. from keras.layers import Input, merge, Activation, Dropout, Dense
  6. from keras.layers.convolutional import Convolution2D
  7. from keras.layers.pooling import AveragePooling2D, GlobalAveragePooling2D
  8. from keras.layers.normalization import BatchNormalization
  9. from keras.regularizers import l2
  10. from keras.optimizers import SGD
  11. from keras.callbacks import ModelCheckpoint
  12. from keras.preprocessing.image import ImageDataGenerator
  13. #增加一层并使用复合函数BN+ReLU+Conv(3x3)
  14. def add_layer(x, nb_channels, kernel_size=3, dropout=0., l2_reg=1e-4):
  15. out = BatchNormalization(gamma_regularizer=l2(l2_reg),
  16. beta_regularizer=l2(l2_reg))(x)
  17. out = Activation('relu')(out)
  18. out = Convolution2D(nb_channels, kernel_size, kernel_size,
  19. border_mode='same', init='he_normal',
  20. W_regularizer=l2(l2_reg), bias=False)(out)
  21. if dropout > 0:
  22. out = Dropout(dropout)(out)
  23. return out
  24. #指定层数和增长率,增加一个dense block
  25. def dense_block(x, nb_layers, growth_rate, dropout=0., l2_reg=1e-4):
  26. for i in range(nb_layers):
  27. # Get layer output
  28. out = add_layer(x, growth_rate, dropout=dropout, l2_reg=l2_reg)
  29. if K.image_dim_ordering() == 'tf':
  30. merge_axis = -1
  31. elif K.image_dim_ordering() == 'th':
  32. merge_axis = 1
  33. else:
  34. raise Exception('Invalid dim_ordering: ' + K.image_dim_ordering())
  35. # Concatenate input with layer ouput
  36. x = merge([x, out], mode='concat', concat_axis=merge_axis)
  37. return x
  38. #增加一个transition layer
  39. def transition_block(x, nb_channels, dropout=0., l2_reg=1e-4):
  40. x = add_layer(x, nb_channels, kernel_size=1, dropout=dropout, l2_reg=l2_reg)
  41. x = AveragePooling2D()(x)
  42. return x
  43. #指定dense block数量、层数、增长率,构建DenseNet
  44. def densenet_model(nb_blocks, nb_layers, growth_rate, dropout=0., l2_reg=1e-4,
  45. init_channels=16):
  46. n_channels = init_channels
  47. inputs = Input(shape=(32, 32, 3))
  48. x = Convolution2D(init_channels, 3, 3, border_mode='same',
  49. init='he_normal', W_regularizer=l2(l2_reg),
  50. bias=False)(inputs)
  51. for i in range(nb_blocks - 1):
  52. # Create a dense block
  53. x = dense_block(x, nb_layers, growth_rate,
  54. dropout=dropout, l2_reg=l2_reg)
  55. # Update the number of channels
  56. n_channels += nb_layers*growth_rate
  57. # Transition layer
  58. x = transition_block(x, n_channels, dropout=dropout, l2_reg=l2_reg)
  59. # Add last dense_block
  60. x = dense_block(x, nb_layers, growth_rate, dropout=dropout, l2_reg=l2_reg)
  61. # Add final BN-Relu
  62. x = BatchNormalization(gamma_regularizer=l2(l2_reg),
  63. beta_regularizer=l2(l2_reg))(x)
  64. x = Activation('relu')(x)
  65. # Global average pooling
  66. x = GlobalAveragePooling2D()(x)
  67. x = Dense(10, W_regularizer=l2(l2_reg))(x)
  68. x = Activation('softmax')(x)
  69. model = Model(input=inputs, output=x)
  70. return model
  71. if __name__ == '__main__':
  72. #1个dense block,里面共2层,feature map数为3
  73. model = densenet_model(1, 2, 3)
  74. from keras.utils.vis_utils import plot_model
  75. plot_model(model, to_file="DenseNet.jpg", show_shapes=True)

生成网络结构为:


对应的前向展开为:


8.9 Mask R-CNN

Mask R-CNN是在《Mask R-CNN》一文中提出,可以看做是Faster R-CNN的升级加强版,结构上也可以理解为:Faster R-CNN+FCN。它是一个通用的检测、识别、语义分割、实例分割的框架,看本章内容前建议先回顾下Faster R-CNN和FCN的内容。

8.10 YOLO

8.11 SSD

8.12 YOLO 9000

References

如有遗漏请提醒我补充:
1、《Understanding the Bias-Variance Tradeoff》
http://scott.fortmann-roe.com/docs/BiasVariance.html
2、《Boosting Algorithms as Gradient Descent in Function Space》
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51.6893&rep=rep1&type=pdf
3、《Optimal Action Extraction for Random Forests and
Boosted Trees》
http://www.cse.wustl.edu/~ychen/public/OAE.pdf
4、《Applying Neural Network Ensemble Concepts for Modelling Project Success》
http://www.iaarc.org/publications/fulltext/Applying_Neural_Network_Ensemble_Concepts_for_Modelling_Project_Success.pdf
5、《Introduction to Boosted Trees》
https://homes.cs.washington.edu/~tqchen/data/pdf/BoostedTree.pdf
6、《Machine Learning:Perceptrons》
http://ml.informatik.uni-freiburg.de/_media/documents/teaching/ss09/ml/perceptrons.pdf
7、《An overview of gradient descent optimization algorithms》
http://sebastianruder.com/optimizing-gradient-descent/
8、《Ad Click Prediction: a View from the Trenches》
https://www.eecs.tufts.edu/~dsculley/papers/ad-click-prediction.pdf
9、《ADADELTA: AN ADAPTIVE LEARNING RATE METHOD》
http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf
9、《Improving the Convergence of Back-Propagation Learning with Second Order Methods》
http://yann.lecun.com/exdb/publis/pdf/becker-lecun-89.pdf
10、《ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION》
https://arxiv.org/pdf/1412.6980v8.pdf
11、《Adaptive Subgradient Methods for Online Learning and Stochastic Optimization》
http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
11、《Sparse Allreduce: Efficient Scalable Communication for Power-Law Data》
https://arxiv.org/pdf/1312.3020.pdf
12、《Asynchronous Parallel Stochastic Gradient Descent》
https://arxiv.org/pdf/1505.04956v5.pdf
13、《Large Scale Distributed Deep Networks》
https://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf
14、《Introduction to Optimization —— Second Order Optimization Methods》
https://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/13-Optimization/04-secondOrderOpt.pdf
15、《On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization》
http://www.maths.ed.ac.uk/ERGO/pubs/ERGO-09-013.pdf
16、《On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes 》
http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf
17、《Parametric vs Nonparametric Models》
http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf
18、《XGBoost: A Scalable Tree Boosting System》
https://arxiv.org/abs/1603.02754
19、一个可视化CNN的网站
http://shixialiu.com/publications/cnnvis/demo/
20、《Computer vision: LeNet-5, AlexNet, VGG-19, GoogLeNet》
http://euler.stat.yale.edu/~tba3/stat665/lectures/lec18/notebook18.html
21、François Chollet在Quora上的专题问答:
https://www.quora.com/session/Fran%C3%A7ois-Chollet/1
22、《将Keras作为tensorflow的精简接口》
https://keras-cn.readthedocs.io/en/latest/blog/keras_and_tensorflow/
23、《Upsampling and Image Segmentation with Tensorflow and TF-Slim》
https://warmspringwinds.github.io/tensorflow/tf-slim/2016/11/22/upsampling-and-image-segmentation-with-tensorflow-and-tf-slim/
24、《DENSELY CONNECTED CONVOLUTIONAL NETWORKS》http://www.cs.cornell.edu/~gaohuang/papers/DenseNet-CVPR-Slides.pdf
25、https://github.com/vivounicorn/convnet-study

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注