Unsupervised Segmentation

目次

Introduction:

Unsupervised training is an inherently difficult problem with negligible precedent. Even though it is considered the future of “AI” by big leagues such as Yoshio Bengio [1] and Yann Lecun, it is currently a difficult topic exacerbated further when dealing with segmentation. The success of modern deep learning algorithms heavily rely on supervision signals and without this supervisory signal there is no feedback on how well the algorithm is doing and thus no way to correct it.

Annotation is an expensive and time consuming process especially in dense pixel segmentation. In this research blog the goal would be to explore options, frameworks and algorithms which would allow segmentation to be done without manual pixel wise labels. An ideal solution to the problem would be that it includes the following properties:

  • Discernible and fairly straight forward to grasp
  • Trained from scratch (i.e does not used pretrained models from imagenet)
  • Can be extended into a semi supervised setting

The dataset chosen to explore unsupervised segmentation is the Spacenet dataset. The reason is that it is open-sourced and contains a large amount of samples which would be needed for unsupervised training. Firstly, the available literature will be explored and discussed following this a direction will be taken and investigated for its efficacy. Finally some thoughts and final remarks will be presented.

Available Literature:

As mentioned above there is little precedent for unsupervised segmentation using deep learning. In general unsupervised and semi-supervised techniques tend to focus on first performing representational learning which is fully unsupervised followed by some form of clustering in the embedded space in the former and fine-tuning on the down stream task in the latter. While this is an excellent research direction there are many pit falls such as the representation learnt might not be suited for the downstream task. Self supervised is definitely an interesting avenue to purse however to date most algorithms are only shown through toy example or specific domain examples, generality is often an issue. As such only unsupervised segmentation literature will be presented.

An interesting non-deep learning approach [6] first does contour detection via multi scale local brightness, color, and texture cues to form a powerful globalization framework using spectral clustering. They then link this contour detector with a generic grouping algorithm. They use normalized cuts from spectral theory to form regions from the contour detection and subsequently group from there. In [4] they form an unsupervised architecture by concatenating two U-net models together. The intermediate representation is the segmentation. They incorporate a soft version of the normalized cut loss so as to have some sort of consistency and smoothness in the segmentation layer. They use CRFs to perform post processing. In [3] they implement an expectation maximization like algorithm whereby features are first extracted by a CNN then each pixel’s embedding are grouped with a superpixel refinement strategy. Grouping of the superpixels is done via hyperparameters and the features from the CNN. Like EM ist follows a label assignment and then an update to the features weights. Both [4] and [3] operate on the BSD500 dataset. In [4] they even compared their results to [6] and show that it only performs equally and in [3] they do not compare but visually it seems to perform the same if not worse.

An interesting paper [5] from NeurIPS uses the idea of scene composure to perform segmentation. They implement a fairly complex GAN architecture in which a segmenter network is trained to segment parts of an image in which a generator then fills the masked part. The discriminator is then trained to of course distinguish real from fake. An interesting point is how they ensure the segmenter network does not produce blank masks by enforcing that the generators output image must contain the noise information used to generate it, similar to infoGAN. It can take a while to work through the paper as the architecture is fairly complex yet the idea is a very promising direction as it allows it to work in an agnostic manner by theoretically being able to segment very different objects. It was shown to work on two relatively simple datasets however of course any GAN training is difficult. From the NeurIPS reproducibility challenge 2 papers [2][8] attempted to reproduce the results in a different deep learning framework. They pointed out several issues which were rectified with the original authors but they were unable to reproduce their results.

In [7] they propose a straight forward approach to unsupervised classification and segmentation. It is based on maximising the mutual information between two samples. They prove it first via classification and then extend it to segmentation. If there is a data pair which contains the same object the goal would be to learn a function which preserves what is in common between the two while discarding instance-specific details. The former can be achieved by maximising the mutual information between the outputs of the function. While the latter can be achieved by using a neural network with a small output capacity such as classes. Without a bottleneck the former can be achieved by just setting the function to the identity as this would be the maximum of the mutual information between the two samples. Mutual information expands to I(z, z^’) = H(z) − H(z|z^’). So maximising the mutual information function is a trade off between minimizing the conditional entropy term, H(z|z^’),  and maximizing the entropy term H(z). The smallest value of the conditional entropy term is obtained when the cluster assignments are exactly predictable from each other [7]. The reader is referred to the paper and its supplementary material for more information as to why this avoids degenerate solutions.

Chosen Direction:

Out of the available literature explored the approach which stands out as not only interesting but also feasibly promising is the approach in the paper [7] Invariant Information Clustering for Unsupervised Image Classification and Segmentation. It satisfies the criteria listed in the Introduction and it can be extended into the semi-supervised setting fairly easily. The training pipeline consists of data pairs which are formed using the original image and a transformed version. The batch of data pairs are fed through the network with shared weights to output a softmax over a predefined number of classes. The outputs are fed into the objective function and backprograted through the network. Segmentation is essentially per pixel classification and thus in order to compare the outputs of the input data pairs output must be transformed back into the original input space so as to preserve spatial consistency.

Overall training pipeline [7]

For example, in the unsupervised setting we would ideally like to segment buildings, roads and vegetation into three classes. However, in order to enable the network to learn a rich set of features it must have the capacity to separate the image into more classes such as buildings, trees, roads, cars, lakes etc. and so another output head is added which has a greater capacity which is referred to as overclustering.

Objective Function Explained:

The mutual information function between two random variables is given as:

Since the last layer of the model is a softmax, the output Φ(x) is a distribution of a discrete random variable, z, over C classes:

The goal is to maximize the mutual information between sample pairs in a batch.

So after marginalization over batch, the joint probability distribution is given a the C × C matrix:

Plugging in this equation into the mutual information function results with the following objective function to maximise:

In order to extend this to segmentation there are a few more tricks needed such as reverting the transformation in the softmax output so as to keep spatial consistency. Please read the author’s paper and supplementary section for more information and look at their code implementation of the loss function [7].

The model:

The model is a standard VGG net style architecture, it is shown below:

  • 1 x conv @ 64 with kernal = 3 and dilation = 1
  • 1 x conv @ 128 with kernal = 3 and dilation = 1
  • 1 x MaxPool 2D
  • 2 x conv @ 256 with kernal = 3 and dilation = 2
  • 2 x conv @ 512 with kernal = 3 and dilation = 2
  • (Main Head) bilinear upsample, 1 x conv @ 512 -> 3 with kernal = 1
  • (Over-clustering Head) bilinear upsample, 1 x conv @ 512 -> 24 with kernal = 1

As you can see the different heads take the encoding, upsample it to the original image resolution and reduce the channels, via 1×1 convolutions, to the desired number of classes to separate into.

Results:

The SpaceNet dataset consists of a large corpus of multi-spectral pan-sharpened satellite images but only the annotations are provided through challenges and so it is not possible to have access to both the building and road annotations at the same time. Thus the v2 building segmentation challenge is chosen specifically the Las Vagas area. The data is preprocessed to only include the RGB and IR channels for simplicity. The number of training samples with resolution 650×650 after post-processing is ~2300 and the number of test samples is ~700.

The main head (3 class output) would ideally output the three main classes being buildings, roads, and vegetation which then need to be matched to the ground truth with a one to one matching in order to calculate the accuracy performance. However, since only the building mask is available, it must be a many to one mapping. I.e building->building and roads&vegetation->not building. The same would go for the over-clustering head (24 class output) which will most likely contain multiple classes that represent the building ground truth class. This matching can be done manually by looking at the outputs but this would be time consuming so a many to one matching algorithm is used based on only 50 ground annotated examples.

The accuracy of the main head is ~79% whereas the over-clustering head is ~82%. As a comparison the model trained in a supervised fashion with 10% of the available data reaches ~90%. Below are some non-cherry picked outputs:

Top, left to right : RGB Input,  Main Head Class Output, Over-Clustering Class Output

Bottom, left to right : Building Groundtruth, Main Head Matched Output, Over-Clustering Matched Output

Final Remarks:

I thought I would end this blog post with some final remarks about what I observed whilst implementing and testing different ideas.

– The loss function tends to more base separating classes based on colours. This coincides with the author’s findings whereby they incorporate two additional sobel edge pre-processed channels as inputs to the model. However, like the author’s findings this only seemed to improve the classification part of the paper and not the segmentation. It is most likely due to the different types of datasets used. For example, satellite images contain significantly more edges than say the STL-10 classification dataset. Out of curiosity I decided to pre-process the images with ZCA whitening as this decorrelates the pixels and forces the model to learn higher level distinguishing features, which is a common approach to unsupervised reconstruction such as auto-encoders. However, this actually made the performance worse which supports the notion of it learning to distinguish mainly between colours and not semantic objects in the scene.

– Following the idea of unsupervised reconstruction, I drew inspiration from the W-Net paper in which they concatenate another model to take the pixel-wise class encoding outputs and try to reconstruct the original image. The idea is that it forces the model to learn semantic object classes because it has to use the class map to fill in the “textures” in order to make it similar to the input image. Additionally I also included the soft normalized cut loss to force the model to learn class boundaries that better match the input image better. Including these losses did not hinder the performance of the model however, what can be noticed in the class encoded output is that a number of the classes appear to be a scattered within the main object classes. Not including the soft normalized cut loss causes the encoding to be less spatial consistent. This is in line with the W-Net paper in which they have to use CRFs and hierarchical grouping algorithms to produce consistent class encodings.  An approach not tried in this blog post would be to include an additional loss which is based on the MSE between adjacent pixel’s class encodings. This would force the encoding to trade off between spatial consistent class encoding and grouping semantically similar objects through the mutual information loss or reconstruction loss.

– The W-Net paper and the ReDo paper are both based on a similar idea of scene composition, meaning a scene should be able to be separate in non-overlapping regions representing different objects. The difference between the two is the ReDO takes on an adversarial approach to take the object classes to “reproduce” the input. This allows the network to be less susceptible to the same object in different colour and texture forms. However, the major draw back apart from the mode collapse caused from the adversarial training is the segmenter network part which can struggle to initially produce semantically consistent outputs. As was mentioned in the introduction they use a few tricks to help guide the network but the training still remains extremely difficult especially with more complex datasets. An interesting avenue to pursue would be incorporating the mutual information loss at the output of the segmenter network part which could help to reduce the initial mode collapses.

References:

[1]  https://www.ibm.com/watson/advantage-reports/future-of-artificial-intelligence/yoshua-bengio.html

[2]  https://openreview.net/pdf?id=Bye09vnGpB

[3] Unsupervised image segmentation by backpropagation, Kanezaki, Asako

[4] W-net: A deep model for fully unsupervised image segmentation, Xia, Xide and Kulis, Brian

[5] Unsupervised object segmentation by redrawing, Chen, Micka and Arti`eres, Thierry and Denoyer, Ludovic

[6] Contour detection and hierarchical image segmentation, Arbelaez, Pablo and Maire, Michael and Fowlkes, Charless and Malik, Jitendra

[7] Invariant information clustering for unsupervised image classification and segmentation, Ji, Xu and Henriques, Jaao F and Vedaldi, Andrea

[8] Reproducibility Challenge@ NeurIPS 2019: Unsupervised Object Segmentation by Redrawing, Chmielewski-Anders, Adrian M and Steinweg, Mats and Straathof, Bas T

Other blog

Deep Learning for Image Segmentation of Tomato Plants

MotivationAt Incubit we are always looking for new ways to apply AI and machine learning technology to interesting problems. A recent project has found us applying image segmentation to the problem of identifying parts of a tomato plant. Tomate pruning is a technique often used to increase the yield of tomato plants by removing small, non-tomato-blooming branches from the plant. Here we describe a method to determine the location of prunable tomato branches, as well as critical parts of the tomato plant which should not be touched, such as primary trunks, branches supporting tomatoes, and the tomatoes themselves.Figure 1: Pruning non-critical branches from a tomato plant. Photo courtesy of gardeningknowhow.com.In this project we apply techniques for Image Segmentation to locate objects of interest. The goal of this analysis method is to locate different segments, or contiguous sets of pixels, within an image which denote some meaningful entity within the image. There are various ways to accomplish this using both computer vision and model-based approaches. We opted for the supervised model-based approach, with the hopes of obtaining both higher accuracy and greater generalization to new images.Obtaining labeled dataThe first step in developing an image model is obtaining labeled training data. For this task we used Incubit’s AI Platform (IAP) to create segmentation labels. Figure 2 shows an example of how we annotated an image to show segments of four classes of interest: main trunk, sucker branch, tomato branch, and tomato.Figure 2: On left – raw image. On right – annotated image. Annotations are: red = main trunk, blue = sucker branch, purple = tomato branch, yellow = tomato.These annotations were stored as JSON files and used to create segmentation masks. Figure 3 shows an example of these masks, created from a crop of the above annotations, which was fed directly into the model as labels:Figure 3: Masks created from the annotated data. The white pixels in each mask act as the target for the segmentation of each respective class.The modelWe based on model architecture on SegNet, a well-known deep neural network which excels in image segmentation applications. Figure 4 shows a simplified overview of the architecture used.A high-level summary of the architecture is this: the original image is passed through a number of encoding blocks, each of which consists of several convolutional layers, batch normalization, and ReLU activation layers, followed by a pooling layer. The reduced features are then passed through a series of upsampling layers. Loss is computed from the cross entropy between the sigmoid output of the final convolutional layer and the segmentation targets (labels).TrainingTraining was performed with a constant learning rate of 0.00001 until there was no improvement in the test error rate for 10 consecutive epochs. A random parameter search yielded the following combination of optimal hyperparameters: 5 encoding and decoding blocks, 32 initial filters, a dropout rate of 0.25, and independent pooling indexes between the encoding and decoding layers.Image augmentation was used to both increase the size of the training pool and to help generalize the model. A combination of flips, crops, random noise, Gaussian blur, fine and coarse dropout, perspective transformations, piecewise affines, rotations, shears, and elastic transformations were used from the imgaug library to reach this end.Figure 5 shows an example of an annotated output frame produced by the trained model.Figure 5: Annotated output showing different segmentation classes of a tomato plant.Post-ProcessingOne of the expected outcomes of this project is the ability to automatically locate the origin and direction of branches growing from the main trunk. To do this, we can utilize the segmentation outputs for the trunk and branch classes. We wrote an algorithm to detect branches which are attached to the main trunk and to extract this information. The steps are:Use the connected components algorithm to identify individual branches and trunks within the image.For each possible pair of branch and trunk segments which partially overlap (full overlaps represent non-connected branch/trunk pairs), record these as connected pairs.For each connected pair, mark the base of the branch as the centroid of the overlapped region. This acts as the starting point of the branch direction vector.Define the end point of the branch direction vector as the halfway point of the shortest line connecting the following entities:The least-squares fit line of the branch segmentThe centroid of the branch segmentFigure 6 shows an example of a branch direction vector drawn from the base of the branch, at the trunk, to the midpoint of the branch:Figure 6: Drawing a branch direction vector.Branch vectors, along with the segmented classes, are superimposed on the original raw image.ResultHere is a video showing the results of this analysis on a tomato garden. Visible are the different class segments and the branch direction vectors.We look forward to applying this technology to other interesting and novel use cases.

Reinforcement Learning Research

MotivationReinforcement learning (RL) is a field of machine learning experiencing rapid change. At Incubit, part of our time is spent keeping up to date with the latest research so that we can deliver the best possible AI solutions to our customers. RL applications are not yet as prevalent as other areas of machine learning. However, the skills acquired while solving these problems are not only invaluable to have as an AI engineer, but also overlap heavily with these other areas.We chose to recreate some of the recent results produced for Atari games, both because it is fascinating and technically challenging, and because open-source simulations are available. The results shown are only for Space Invaders, but the agent trained on this game also performs well on other Atari games.Deep Q NetworkThe agent used is based upon a deep Q network architecture. This approach utilizes several powerful methods which allow the agent to choose an action given a raw image:Convolutional filtering to create a vector representation of features present in an imageCombination of consecutive framesAn intelligent replay mechanism which utilizes important memories to accelerate trainingA value iteration method (Q-learning) to map states to future rewardsSeparate determination of value and advantage within the networkUse of separate networks to choose an action and to generate target Q values for that actionArchitectureIn general, a Q value is a function of the value at a specific state and the advantage to be gained by following each of the possible actions. To obtain a state representation, we have to reduce the dimensionality of the input data. Each input is a set of 4 frames of size 210 x 160 pixels with 3 color channels. With 256 values per pixel, this gives us a state space of 10 ^ 971004. We make this problem tractable through the use of a 4-layer convolutional neural network, which reduces this input space to a vector of size 512, a much more manageable number.Multiple frames are utilized because it is generally not possible to determine the velocity of moving objects from a single frame. See two consecutive frames below:By running multiple frames through the network, we can obtain a state representation not only of the current frame, but also of the velocity and acceleration of objects within the frame.Here is a representation of the networks used to generate actions from a batch of consecutive frames:The output of the last convolutional layer is fully-connected to a vector of size 512. This is split down the middle, and each resulting vector of size 256 is multiplied by a weight matrix to obtain the value and advantage functions. These are combined to obtain the final Q-value vector of length 6 for each of the actions. The action taken by the agent corresponds to the maximum Q-value in this output vector.The second Target network is introduced for training stability. For each training step, the Main network generates the action to choose, and the Target network generates the target Q value for that action. The Main network is trained every step by minimizing the loss between generated and target Q values, and the Target network is updated with the network weights of the Main network every 500 steps.Experience ReplayAn experience buffer along with a prioritized replay scheme was used to accelerate training. The buffer consists of a maximum of 100,000 individual game frames, along with the action taken and the reward realized at that frame. During a network training step, a batch of replay examples is sampled from the buffer corresponding to a priority value assigned to each entry.This value is a function of the difference between the Q values predicted by the network at time t-1 and the Q values predicted at time t. This is a measure of how “surprising” the transition is to the network. More surprising transitions are more likely to be chosen during the network training step.ExplorationExploration was achieved by using a simple random strategy: choose a random action with probability e. The value e was reduced from 1.0 to 0.025 linearly over 100,000 frames. Maintaining this minimum exploration helps to prevent the agent from settling on a sub-optimal policy. For Space Invaders, it is also useful for learning correct behavior near the end of a stage, where there is oftentimes only one alien ship left and it moves quickly. In this scenario, it is useful to learn both correct and incorrect behaviors (e.g. shooting and killing the alien vs. hesitating and letting it touch down).TrainingTraining steps were performed every 8 time steps on a batch of 32 frames sampled from the priority replay buffer. The target network was fully updated with the weights from the primary network every 500 time steps. Training was arbitrarily performed for 100,000,000 frames, which took about 32 hours of runtime on a Titan X GPU.ResultsTo get an idea of what the network is doing at every frame, we plotted the gameplay as well as the following information at every frame:ValueAdvantage for each of the 6 actionsA running percentage of the actions taken by the modelHA handful of the activations at each of the layers in the convolutional networkHere’s a video showing an agent, which was trained for 15 hours, playing a game of Space Invaders:Here are a few key takeaways from the visualizations:The value is a function of the amount of enemies on the map, the missiles fired by the avatar, and the location of the avatar. If the avatar moves beneath a shield, for example, the value decreases.Most actions constitute a move, a missile firing, or both. This means that the agent rarely spends any time simply waiting.Oftentimes the advantage of a certain action is only slightly higher than others. In the screenshot above, for example, the highest advantage is given to Move Left & Fire Missile. However, the constituent actions are also considered advantageous to the agent.The convolutional layers show glimpses into some of the features that the agent sees. Missiles being fired are visible as activated neurons, as are the enemy ships and the avatar itself.It has been a fascinating and challenging project, and has been delightful to catch a glimpse of what is happening inside of the trained network. In the future we hope to continue training AI agents to play Atari games, as well as other benchmark simulations.