Learning Deconvolution Network for Semantic Segmentation
首先介绍了基于FCN的语义分割,如论文:
Fully Convolutional Networks for Semantic Segmentation
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connectred CRFs
FCN解释如下:
fully connected layers in the standard CNNs are interpreted as convolutions with large receptive fields, and segmentation is achieved using coarse class score maps obtained by feedforwarding an input image.
An interesting idea in this work is that a simple interpolation filter is employed for deconvolution and only the CNN part of the network is fine-tuned to learn deconvolution indirectly.
FCN存在以下问题:
如图所示:
first, the network can handle only a single scale semantics within image due to the fixed-size receptive field
首先,由于固定大小的感受域,网络只能处理图像中的单个尺度语义。
second, the detailed structures of an object are often lost or smoothed because the label map, input to the deconvolutional layer is too coarse and deconvolution procedure is overly simple
其次,一个对象的详细结构常常丢失或被平滑掉,这是因为输入到反卷积层的标签图太小,且反卷积过程过于简单。
反卷积网络引入:
Deconvolution network is introduced in [25] to reconstruct input images. As the reconstruction of an input
image is non-trivial due to max pooling layers, it proposes
the unpooling operation by storing the pooled location. Using the deconvoluton network, the input image can be reconstructed from its feature representation。
通过记录最大池化的位置信息,反卷积可重建图片
although it helps classification by retaining only robust activations in upper layers, spatial information within a receptive field is lost during pooling, which may be critical for precise localization that is required for semantic segmentation.
to resolve such issue, we employ unpooling layers in deconvolution network, which perform the reverse operation of pooling and reconstruct the original size of activations
要解决这样的问题,我们采用unpooling层卷积网络,进行统筹的反向操作和重建激活原始大小
it records the locations of maximum activations selected during pooling operation in switch variables, which are employed to place each activation back to its original pooled location
记录池化操作选择的最大激活的位置,在反池化操作中将每个激活放回原来池化前的位置。
反卷积:
the output of an unpooling layer is an enlarged, yet sparse activation map.
反池化层的输出是一个扩大但稀疏的激活图。
the deconvolution layers densify the sparse activations obtained by unpooling through convolution-like operations with multiple learned filters
反卷积层利用多个可学习的反卷积核,执行类似卷积的运算将反池化层得到稀疏激活图变稠密。
however, contrary to convolutional layers, which connect multiple input activations within a filter window to a single activation, deconvolutional layers associate a single input activation with multiple outputs, as illustrated in figure
然而,与卷积核的多个输入单个激活输出不同,反卷积核只有单个输入但有多个激活输出
the filters in lower layers tend to capture overall shape of an object while the class-specific finedetails are encoded in the filters in higher layers
Unpooling captures example-specific structures by tracing the original locations with strong activations back to image space. As a result, it effectively reconstructs the detailed structure of an object in finer resolutions. On the other hand, learned filters in deconvolutional layers tend to capture class-specific shapes
We remove the drop-out layers due to batch normalization
分两阶段进行训练:
first, limiting the variations in object location and size, we reduce search space for semantic segmentation significantly and train the network with much less training examples successfully
the proposed network is trained to perform semantic segmentation for individual instances. given an input image, we first generate a sufficient number of candidate proposals, and apply the trained network to obtain semantic segmentation maps of individual proposals. then we aggregate the outputs of all proposals to produce semantic segmentation on a whole image.
construct the pixel-wise class score map of an image by aggregating the outputs of all proposals.
通过汇总所有建议的输出构造图像的像素级得分图
finally, we apply the fully-connected crf [14] to the output maps for the final pixel-wise labeling, where unary potential are obtained from the pixel-wise class conditional probability maps
最后,我们将全连接的CRF[14]应用于最终像素标记的输出,其中CRF一元位势是从像素级的条件概率图中得到的。
Given two sets of class conditional probability maps of an input image computed independently by the proposed method and FCN, we compute the mean of both output maps and apply the CRF to obtain the final semantic segmentation
训练:
We initialize the weights in the convolution network using VGG 16-layer net pre-trained on ILSVRC [4] dataset, while the weights in the deconvolution network are initialized with zero-mean Gaussians.
测试:
We employ edge-box [26] to generate object proposals. For each testing image