visit
In this article and the following, we will take a close look at two computer vision subfields: Image Segmentation and Image Super-Resolution. Two very fascinating fields.
Two years ago after I had finished the Andrew NG course I came across one of the most interesting papers I have read on segmentation(at the time) entitled (Bilateral Segmentation Network) which in turn served as a starting point for this blog to grow because of a lot of you, my viewers were also fascinated and interested in the topic of semantic segmentation.
More we understand something, less complicated it becomes.
– Unknown
I did my best at the time to code the architecture but to be honest, little did I know back then on how to preprocess the data and train the model, there were a lot of gaps in my knowledge. I understood semantic segmentation at a high-level but not at a low-level.
Real knowledge is to know the extent of one’s ignorance.– Confucius
Fig 1: These are the outputs from my attempts at recreating BiSeNet using TF Keras from 2 years ago 😂. A true work of art!!!
Pretty amazing aren’t they? I knew this was just the beginning of my journey and eventually, I would make it work if I didn’t give up or perhaps I would use the model to produce abstract art.
With that said this is a revised update on that article that I have been working on recently thanks to FastAI 18 Course.
We change from inputting an image and getting a categorical output to having images as input and output. This is done by cutting and replacing the classification head with an upsampling path (this type of architectures are called fully convolutional networks).
Don’t worry if you don’t understand it yet, bear with me.
, 2016.
Something interesting happened during my testing I’m not fully sure if it is the new Pytorch v1 or Fastai v1 but previously for multi-class segmentation tasks you could have your model output an image of size (H x W x 1) because as you can see in Fig 6 the shape of the segmentation mask is (960 x 720 x 1) and the matrix contains pixels ranging from 0–Classes, but with Pytorch v1 or Fastai v1 your model must output something like (960 x 720 x Classes) because the loss functions won’t work (nn.BCEWithLogitsLoss(), nn.CrossEntropyLoss() and etc), it will give you a Cuda device asserted error on GPU and size mismatch on CPU.
Fig 8. output tensor
The only case where I found outputting (H x W x 1) helpful was when doing segmentation on a mask with 2 classes, where you have an object and background.
This happens because now the loss functions essentially one hot encodes the target image(segmentation mask) along the channel dimension creating a binary matrix(pixels ranging from 0–1) for each possible class and does binary classification with the output of the model, and if that output doesn’t have the proper shape(H x W x C) it will give you an error.
This is setup if just for training, afterwards, during testing and inference you can argmax the result to give you (H x W x 1) with pixel values ranging from 0-classes.
Fig 9. My outputs using the architecture describe above
Unet
Fig 10. U-net Arch from the paper
This architecture consists of two paths, the downsampling path(left side) and an upsampling path(right side).
This method is much better than the method specified in the section above.
The main contribution of this paper is the U-shaped architecture that in order to produce better results the high-resolution features from downsampling path are combined(concatenated) with the equivalent upsampled output block and a successive convolution layer can learn to assemble a more precise output based on this information.
Another important modification to the architecture is the use of a large number of feature channels at the earlier upsampling layers, which allow the network to propagate context information to the subsequent higher resolution upsampling layer.
Context information: information providing sufficient receptive field. In the semantic segmentation task, the receptive field is of great significance for the performance.
This strategy allows the seamless segmentation of arbitrary size images.
Downsampling Path
The downsampling path can be any typical arch. of a ConvNet without the classification head for e.g: ResNet Family, Xception, MobileNet and etc. At each downsampling step, we double the number of feature channels(32, 64, 128, 256…).
Upsampling Path
Every step of the upsampling path consists of 2x2 convolution upsampling that halves the number of feature channels(256, 128, 64), a concatenation with the correspondingly cropped(optional) feature map from the downsampling path, and two 3x3 convolutions, each followed by a ReLU.
The authors of the paper specify that cropping is necessary due to the loss of border pixels in every convolution, but I believe adding reflection padding can fix it, thus cropping is optional. At the final layer, the authors use a 1x1 convolution to map each 64 component feature vector to the desired number of classes, while we don’t do this in the notebook you will find at the end of this article.
Fig 11. My outputs using a Unetish arch.
It’s a module that builds a U-Net dynamically from any model(backbone) pretrained on ImageNet, since it’s dynamic it can also automatically infer the intermediate sizes and number of in and out features.
The difference from original U-Net is that the downsampling path is a pretrained model.
This learner packed with most if not all the image segmentation best practice tricks to improve the quality of the output segmentation masks.
This learner is composed of:
Class DynamicUnetClass UnetBlock
DynamicUnet
This U-Net will sit on top of a backbone (that can be a pretrained model) and with a final output of n_classes. During the initialization, it uses Hooks to determine the intermediate features sizes by passing a dummy input through the model and create the upward path automatically.
Arguments:
Blur: It takes blur flag to avoid checkerboard artifacts at each layer.Self_Attention: an Attention mechanism is applied to selectively give more importance to some of the locations of the image compared to others.Bottle: it determines whether we use a bottleneck or not for the cross-connection from the downsampling path to the upsampling path.
UnetBlock
A quasi-UNet block, that uses PixelShuffle upsampling and ICNR weight initialisation, both which are best practice techniques to eliminate checkerboard artifacts in Fully Convolutional architectures. Introduced in the .
It uses hooks to store the output of each block needed for the cross-connection from the backbone model.
There 3 key takeaways:
Thank you very much for reading, you are really amazing. I do this for you.
Twitter:
LinkedIn:
Email: [email protected]