Train Image Segmentation Models to Accept User Feedback via Voronoi Tiling, Part 2

How to train an off-the-shelf image segmentation model to respond to user feedback

Published in

Towards Data Science

9 min readMay 5, 2023

This is Part 2 of a series of articles about training image segmentation models so that the models respond to user feedback and adjust their predictions based on the feedback (mouse clicks).

In Part 1 we’ve described the general strategy for training off-the-shelf image segmentation models to respond to user feedback. The problem identified at the end of Part 1 was that manually generating the clicks needed to train the models is tedious, time-consuming, and may not be feasible at all if the datasets are very large and/or the models need to be re-trained often. Generating the clicks needs to be automated — this is the topic of this article.

The Problem

Let’s take a look again at the problem we’re trying to solve:

source: Dataset of Breast Ultrasound Images

The left-hand frame is the image with the ground-truth label; the region of interest (RoI) was marked in yellow by a human expert; this is the ideal shape of the prediction we expect from the model. The center frame is the actual prediction from the model. The right-hand frame shows true positive areas (where the label and the prediction coincide), false positive areas (where the model predicts an RoI but there is no such thing in the label), and false negative areas (where the model predicts nothing, but there is an actual RoI). TP areas are shown in white, FP in green, and FN in red.

To steer the model’s predictions, we’ve placed positive (green) clicks in the TP and FN areas, and negative (red) clicks in the FP areas, and then trained new models with the clicks included in the images.

Placing the clicks is intuitively easy for a human operator. But if you break down the process into separate logical steps and criteria, it gets quite convoluted:

split TP, FP, FN areas into separate contiguous segments
discard very small segments as irrelevant
for each remaining segment, decide on the total number of clicks that will be placed within, based on the segment area
the clicks cannot be placed too close to each other
the clicks cannot be placed too close to the edge of the segment

The last two criteria are hard. The ambiguity (“not too close”) and the fact that these two criteria contradict each other, make it seem like it would be difficult to generate the clicks in a way that is guaranteed to converge to a solution that imitates what a human operator would do.

However, we will show there is such a method, which combines a mathematical concept (Voronoi tesselation) with hints from physics (energy, and simulated annealing) to produce the desired result.

Voronoi Tesselation

The Wikipedia page explains the concept quite well, and this may seem familiar if you’ve studied the algorithm behind k-means clustering, but let me add a few words here as well.

In the left-hand frame, we have a square area with a few seed points. For any seed point, there must be a region in the frame (a tile) in which all pixels are closer to that seed than to all the other seeds. In the right-hand frame we show these tiles, color-coded to match the seeds. Each tile is called a Voronoi cell, and the result of the process of finding the tiles is called Voronoi tesselation.

The seeds in this example are chosen randomly. The tesselation we get is not uniform. To get a uniform tesselation, the seeds would have to also be the centroids of their corresponding tiles (or close to the centroids) — this is called centroidal Voronoi tesselation. Here is a very trivial example out of many possible examples:

To find the click coordinates (the seeds) that would lead to a centroidal Voronoi tesselation of an area, or an approximation, something like Lloyd’s algorithm could be used, and it’s very fast (it is the standard solver for k-means clustering). Here is a simulator for Lloyd’s algorithm which works in your browser in real time. But there are two problems:

Lloyd’s algorithm is generally used to tile rectangular areas. It is not clear how (or whether) it would generalize to the arbitrary area shapes we need to tile for our models.
We only want centroidal tiling when the shapes (area and tiles) are convex. When the shapes are concave, the centroids may fall outside the area we’re tiling, and that’s not what we want at all (the clicks would be outside the area).

So we need something that works with arbitrary shapes, keeps the clicks inside the area even when the shapes are concave, and behaves like Lloyd’s algorithm for trivial rectangular areas. A generalization of centroidal Voronoi tiling, that works with arbitrary areas, even with concave shapes. That’s the topic of the next section.

Simulated Annealing for Energy

Consider this tiling:

The click distribution is uniform enough, and the click coordinates are not far from the centroids (all shapes are convex). It’s not a bad distribution for our purposes. Can we find an objective function, corresponding to the click coordinates, that we could try to maximize or minimize, iteratively, that would lead us to such a distribution?

Let’s look at the space around the clicks:

Imagine every pixel in the image is assigned an “energy”. Only one click contributes to the pixel’s energy — the closest click. All other clicks do not contribute at all. The energy of any pixel is inversely proportional to its distance to its nearest click. So to find the energy of any pixel, we need to:

find the nearest click
calculate the distance from the pixel to the click
calculate the inverse of the distance, which is the pixel’s energy

The image shown above is just the heat map of pixel energies, given the click distribution. The edges of the Voronoi tiles are already suggested by the darkest areas mid-way between neighboring clicks.

If we calculate the total energy of all pixels, and then move the clicks around, seeking the click positions that give the highest total energy, would that lead to a uniform tiling of the area? As a matter of fact, that’s exactly how the click distribution shown above was obtained:

start with completely random click coordinates
compute the total energy of all pixels in the region of interest
apply simulated annealing to the click coordinates, so as to find the coordinates that maximize total energy

The full code is shown here. The algorithm is robust and can handle concave shapes just fine — here is an example of placing clicks in a visually uniform way within a sickle-shaped segment:

The segment is the lighter shade of blue, the shape of a sickle, that stands in contrast with the dark background. The clicks are the brightest spots.

The clicks do not get too close to the edge of the segment, because that would reduce the total energy (pixels beyond the segment’s edge have no energy). They do not get too close to each other, because that would not “energize” pixels far from the tight group of clicks. The algorithm is self-regulating.

Note: this problem has similarities to the cell phone tower coverage problem, where you try to place N cell phone towers on a map such that the signal is as strong as possible in most areas.

Back to Segmentation Models

To recap, we’re trying to train image segmentation models to make them responsive to user feedback (mouse clicks). The overall process is:

split the image dataset into 5 folds
train a segmentation model for each fold; this produces the set of 5 baseline models
use the baseline models to make predictions on all images; each model makes predictions on images it has not seen in training
compare predictions with labels; extract all areas containing true positive, false positive, false negative predictions
split the TP, FP, FN areas into contiguous segments; discard the smallest segments (less than 100 pixels or so)
for each segment, generate uniform clicks as shown in this article; the number of clicks depends on the area of the segment: bigger segments receive more clicks, up to a reasonable limit (4 … 5 or so, for images 512x512 pixels in size)
move all image information into the B channel, vacate the R and G channels, embed the clicks into the R and G channels; clicks for the TP and FN areas are in the G channel (positive clicks); clicks for the FP areas are in the R channel (negative clicks)
using the click-augmented images, train a new set of 5 models on the same folds; these are the click-trained models, which are the final artifact of this whole project

Here are some examples of uniform clicks generated for image segments from the actual baseline model predictions. We’ve picked 3 images from the dataset, made predictions with the baseline models, and we look at each image’s TP, FP, FN segments. Each segment is a lighter shade of blue than the background, and the clicks are the brightest spots in each segment.

The clicks end up being placed more or less where a human operator would place them: not too close to each other, not too close to the edge, favoring bulk or wide areas in each segment. The click distribution appears to be visually uniform.

Final Thoughts

We’ve trained off-the-shelf image segmentation models to respond to user feedback, without altering their architecture in any way, and without re-training them from scratch on ImageNet.

The click-trained models do not necessarily perform better than the baseline models. Training with clicks as shown here only enables the models to respond to user feedback. Sure, the click-trained models will outperform the baseline models by a large margin on the dataset used to create the 5 training folds. That’s because creating the clicks essentially leaks data between training and testing. On data that is 100% previously unseen, the click-trained models and the baseline models perform the same.

Here are a couple more examples of the way the click-trained models respond to user feedback.

Two different images are shown in the video. In both images you can see the model has high confidence in its predicted RoI. Attempts to place negative clicks in the predicted RoI are not very successful — the model continues to predict that region as an RoI.

The model does accept suggestions for other areas as potential RoIs.

In both cases you can see two kinds of outputs from the model: pure segmentation, and heat map. The heat map is simply a map of likelihood for the RoI.

Links, Citations, Comments

This project is an extension of my capstone project from the final semester of my MS in Data Science studies: https://github.com/FlorinAndrei/datascience_capstone_project

Both the capstone and this work were done within the Computer Aided Diagnosis for Breast Ultrasound Imagery (CADBUSI) project at the University of Wisconsin-La Crosse, under the supervision of Dr. Jeff Baggett. https://datascienceuwl.github.io/CADBUSI/

The GitHub repository with code for this article: https://github.com/FlorinAndrei/segmentation_click_train

All ultrasound images used in this article are part of the Dataset of Breast Ultrasound Images, available under the CC BY 4.0 license. Citation link:

Al-Dhabyani, W., Gomaa, M., Khaled, H., & Fahmy, A. (2019). Dataset of Breast Ultrasound Images. ResearchGate. Retrieved May 1, 2023 from https://www.sciencedirect.com/science/article/pii/S2352340919312181