Image segemantation


当前分割算法总结与一些实现
利用kaggle上的Carvana_Image_Masking_Challenge来熟悉分割任务
KeyWords Plus:      语义分割      实例分割

Introduction

        分割可以分为语义分割实例分割,如下图所示:

Alt text

(FCN) 全卷积网络

        Paper:FCN
        FCN是一个全卷积网络(Fully Convolutional Networks,FCN),与用于分类任务 (classification)的网络相比,FCN去掉了global pooling层,所有的全连接层都被替换成了卷 积层,目的是为了得到dense prediction(比如分类任务是对整张图像分类,分割任务是对图 像中的每个像素分类)。

         可以发现,经过多次卷积(还有pooling)以后,得到的图像越来越小,分辨率越来越低(粗略的图像),那么FCN是如何得到图像中每一个像素的类别的呢?为了从这个分辨率低的粗略图像恢复到原图的分辨率,FCN使用了上采样。例如经过5次卷积(和pooling)以后,图像的分辨率依次缩小了2,4,8,16,32倍。对于最后一层的输出图像,需要进行32倍的上采样,以得到原图一样的大小

Alt text

         这个上采样是通过反卷积(deconvolution)实现的。对第5层的输出(32倍放大)反卷积到原图大小,得到的结果还是不够精确,一些细节无法恢复。将第4层的输出和第3层的输出也依次反卷积,分别需要16倍和8倍上采样,结果就精细一些了。下图是这个卷积和反卷积上采样的过程:

Alt text

SegNet

        Paper:SegNet

        将最大池化指数转移至解码器中,改善了分割分辨率。FCN网络仅仅复制了编码器特征,而Segnet网络复制了最大池化指数。这使得在内存使用上,SegNet比FCN更为高效。

Alt text

U-Net

        Paper:U-Net

Alt text

        U-Net 简单地将编码器的特征图拼接至每个阶段解码器的上采样特征图,从而形成一个梯形结构。该网络非常类似于 Ladder Network 类型的架构。

Fully Convolutional DenseNet

        Paper:Fully Convolutional DenseNet

Alt text

        全卷积 DenseNet 使用DenseNet作为它的基础编码器,并且也以类似于 U-Net的方式,在每一层级上将编码器和解码器进行拼接。

E-Net

        Paper:E-Net

Alt text

        在这篇文章中,我们提出了一种新的深度神经网络架构,称为 ENet(efficient neural network),专门为需要低延迟操作的任务创建。ENet 比当前网络模型快 18 倍,少了 75 倍的 FLOPs,参数数量降低了 79 倍,并且提供相似甚至更好的准确率。我们在CamVid、Cityscapes 和 SUN数据集上进行了测试,展示了与现有的最优方法进行比较的结果,以及网络准确率和处理时间之间的权衡。

Mask R-CNN

        Paper:Mask R-CNN

Alt text

        在 Faster R-CNN上添加辅助分支以执行语义分割。
        对每个实例进行的 RoIPool 操作已经被修改为 RoIAlign ,它避免了特征提取的空间量化,因为在最高分辨率中保持空间特征不变对于语义分割很重要。

PSPNet

        Paper:PSPNet

Alt text

        在文中我们利用基于不同区域的上下文信息集合,通过我们的金字塔池化模块,使用金字塔场景解析网络(PSPNet)来发挥全局上下文信息的能力。我们的全局先验表征在场景解析任务中产生了良好的质量结果,而 PSPNet为像素级的预测提供了一个更好的框架,该方法在不同的数据集上达到了最优性能。它首次在 2016 ImageNet 场景解析挑战赛 PASCAL VOC 2012 基准和 Cityscapes 基准中出现。

        PSPNet通过引入空洞卷积来修改基础的ResNet 架构,特征经过最初的池化,在整个编码器网络中以相同的分辨率进行处理(原始图像输入的 1/4),直到它到达空间池化模块。

        在ResNet 的中间层中引入辅助损失,以优化整体学习。

        在修改后的 ResNet编码器顶部的空间金字塔池化聚合全局上下文。

Alt text

        图片展示了全局空间上下文对语义分割的重要性。它显示了层之间感受野和大小的关系。

RefineNet

        Paper:RefineNet

        在文中提出了 RefineNet,一个通用的多路径优化网络,它明确利用了整个下采样过程中可用的所有信息,使用远程残差连接实现高分辨率的预测。通过这种方式,可以使用早期卷积中的细粒度特征来直接细化捕捉高级语义特征的更深的网络层RefineNet 的各个组件使用遵循恒等映射思想的残差连接,这允许网络进行有效的端到端训练。

Alt text

        使用多分辨率作为输入,将提取的特征融合在一起,并将其传递到下一个阶段。
        引入链式残差池化,可以从一个大的图像区域获取背景信息。它通过多窗口尺寸有效地池化特性,利用残差连接和学习权重方式融合这些特征。
        所有的特征融合都是使用 sum(ResNet 方式)来进行端到端训练。
        使用普通ResNet的残差层,没有计算成本高的空洞卷积。

G-FRNet

        Paper:G-FRNet

        提出了 Gated Feedback Refinement Network (G-FRNet),这是一种用于密集标记任务的端到端深度学习框架,解决了现有方法的局限性。最初,GFRNet 进行粗略地预测,然后通过在细化阶段有效地集成局部和全局上下文信息,逐步细化细节。我们引入了控制信息前向传递的门控单元,以过滤歧义。

Alt text
G-FRNet 架构

Alt text
门控细化单元

        在编码器中,从高分辨率(较难判别)层到对应的解码器中相应的上采样特征图的信息,不确定是否对分割有用。在每个阶段,通过使用门控细化反馈单元,控制从编码器传送到解码器的信息流,这样可以帮助解码器解决歧义,并形成更相关的门控空间上下文。

DecoupledNet

        Paper:DecoupledNet

        与现有的将语义分割作为基于区域分类的单一任务的方法相反,我们的算法将分类和分割分离,并为每个任务学习一个单独的网络。在这个架构中,通过分类网络识别与图像相关的标签,然后在分割网络中对每个识别的标签执行二进制分割。它通过利用从桥接层获得的特定类的激活图来有效地减少用于分割的搜索空间。

Alt text

        分离分类和分割任务,从而使预训练的分类网络能够即插即用(plug and play)。
        分类和分割网络之间的桥接层生成突出类的特征图(k 类),然后输入分割网络,生成一个二进制分割图(k 类)
        但是,这个方法在一张图像中分割 k 类需要传递 k 次。

Carvana_Image_Masking_Challenge

Introduction

Carvana Image Masking Challenge–1st Place Winner’s Interview

                kaggle

Alt text

       In this competition, you’re challenged to develop an algorithm that automatically removes the photo studio background. This will allow Carvana to superimpose cars on a variety of backgrounds. You’ll be analyzing a dataset of photos, covering different vehicles with a wide variety of year, make, and model combinations.

Dataset

数据分析

File descriptions

  • /train/ - this folder contains the training set images
  • /test/ - this folder contains the test set images. You must predict the mask (in run-length encoded format) for each of the images in this folder
  • /train_masks/ - this folder contains the training set masks in .gif format
  • train_masks.csv - for convenience, this files gives a run-length encoded version of the training set masks.
  • sample_submission.csv - shows the correct submission format
  • metadata.csv - contains basic information about all the cars in the dataset. Note that some values are missing.

Keras_U-net

Network

Alt text

       The main idea in is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. Hence, these layers increase the resolution of the output. In order to localize, high resolution features from the contracting path are combined with the upsampled output

Loss

       The cross entropy :

Alt text

       where ` : Ω → {1,…,K} is the true label of each pixel and w : Ω → R is a weight map that we introduced to give some pixels more importance in the training.

       The Dice coefficient :

Alt text

       where X is the predicted set of pixels and Y is the ground truth. The Dice coefficient is defined to be 1 when both X and Y are empty.

Code

       VGG

1
2
3
4
from keras.applications.vgg16 import VGG16 as PTModel
base_pretrained_model = PTModel(input_shape = t0_img.shape[1:], include_top = False, weights = 'imagenet')
base_pretrained_model.trainable = False
base_pretrained_model.summary()

       Collect Interesting Layers for Model

1
2
3
4
5
6
7
8
9
10
11
12
13
from collections import defaultdict, OrderedDict
from keras.models import Model
layer_size_dict = defaultdict(list)
inputs = []
for lay_idx, c_layer in enumerate(base_pretrained_model.layers):
if not c_layer.__class__.__name__ == 'InputLayer':
layer_size_dict[c_layer.get_output_shape_at(0)[1:3]] += [c_layer]
else:
inputs += [c_layer]
# freeze dict
layer_size_dict = OrderedDict(layer_size_dict.items())
for k,v in layer_size_dict.items():
print(k, [w.__class__.__name__ for w in v])

       Build the U-Net

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from keras.layers import Input, Conv2D, concatenate, UpSampling2D, BatchNormalization, Activation, Cropping2D, ZeroPadding2D
x_wid, y_wid = t0_img.shape[1:3]
in_t0 = Input(t0_img.shape[1:], name = 'T0_Image')
wrap_encoder = lambda i_layer: {k: v for k, v in zip(layer_size_dict.keys(), pretrained_encoder(i_layer))}

t0_outputs = wrap_encoder(in_t0)
lay_dims = sorted(t0_outputs.keys(), key = lambda x: x[0])
skip_layers = 2
last_layer = None
for k in lay_dims[skip_layers:]:
cur_layer = t0_outputs[k]
channel_count = cur_layer._keras_shape[-1]
cur_layer = Conv2D(channel_count//2, kernel_size=(3,3), padding = 'same', activation = 'linear')(cur_layer)
cur_layer = BatchNormalization()(cur_layer) # gotta keep an eye on that internal covariant shift
cur_layer = Activation('relu')(cur_layer)

if last_layer is None:
x = cur_layer
else:
last_channel_count = last_layer._keras_shape[-1]
x = Conv2D(last_channel_count//2, kernel_size=(3,3), padding = 'same')(last_layer)
x = UpSampling2D((2, 2))(x)
x = concatenate([cur_layer, x])
last_layer = x
final_output = Conv2D(dm_img.shape[-1], kernel_size=(1,1), padding = 'same', activation = 'sigmoid')(last_layer)
crop_size = 20
final_output = Cropping2D((crop_size, crop_size))(final_output)
final_output = ZeroPadding2D((crop_size, crop_size))(final_output)
unet_model = Model(inputs = [in_t0],
outputs = [final_output])
unet_model.summary()

Result

Alt text

Pytorch_U-net

       Relvant blog: Carvana Image Masking Challenge–1st Place Winner’s Interview

Introduction

Artsiom’s approach

First network: UNet from scratch
       resized 1024x1024 input images and upscaled the predicted masks back to the original resolution at the inference step.

       When calculating BCE loss, each pixel of the mask was weighted according to the distance from the boundary of the car. This trick was proposed by Heng CherKeng. Pixels on the boundary had 3 times larger weight than deep inside the area of the car.

Second network: UNet-VGG-11

Vladimir’s Approach

       Original Images had resolution (1918, 1280) and were padded to (1920, 1280)
       Another approach, used by other participants, was to downscale input images, but this could lead to some losses in accuracy. Since the scores were so close to each other, I did not want to lose a single pixel on this transformations (recall 0.000001 margin between the first and the second place at the Private Leaderboard)

       In the model I used the following loss function:

Alt text

Alexander’s approach

       started to look for new architectures and found a machine learning training video showing how to use LinkNet for image segmentation. I found the source paper and tried it out.

LinkNet is a classical encoder-decoder segmentation architecture with following properties:

       1.As an encoder, it uses different layers of lightweight networks such as Resnet 34 or Resnet 18.
       2.Decoder consists of 3 blocks: convolution 1x1 with n // 4 filters, transposed convolution 3x3 with stride 2 and n // 4 filters, and finally another convolution 1x1 to match the number of filters with an input size.
       3.Encoder and decoder layers with matching feature map sizes are connected through a plus operation. I also tried to concatenate them in filters dimension and use conv1x1 to decrease the number of filters in the next layers - it works a bit better.

Alt text

Carvana-challenge

       Github Carvana-challenge

反馈与建议

文章目录
  1. 1. Introduction
    1. 1.1. (FCN) 全卷积网络
    2. 1.2. SegNet
    3. 1.3. U-Net
    4. 1.4. Fully Convolutional DenseNet
    5. 1.5. E-Net
    6. 1.6. Mask R-CNN
    7. 1.7. PSPNet
    8. 1.8. RefineNet
    9. 1.9. G-FRNet
    10. 1.10. DecoupledNet
  2. 2. Carvana_Image_Masking_Challenge
    1. 2.1. Introduction
    2. 2.2. Dataset
      1. 2.2.1. 数据分析
    3. 2.3. Keras_U-net
      1. 2.3.1. Network
      2. 2.3.2. Loss
      3. 2.3.3. Code
      4. 2.3.4. Result
    4. 2.4. Pytorch_U-net
      1. 2.4.1. Introduction
    5. 2.5. Carvana-challenge
  3. 3. 反馈与建议
|