647 reads

Use Cascade Models to Get Better Speed and Accuracy in Computer Vision Tasks

by Argo SaakyanNovember 8th, 2022

Too Long; Didn't Read

Object detection is a very common task in Computer Vision. YOLOv5s can’t be the best out of the box for every task. It is easy to get best of both worlds - fast detector and accurate classifier. In this case detector can be trained only on damaged road. Classifier is easier to retrain on new data (labelling and training are faster) And this solution is a lot faster to deliver than customising detector architecture and retraining model from scratch.

featured image - Use Cascade Models to Get Better Speed and Accuracy in Computer Vision Tasks

Object detection is a very common task in Computer Vision. For sure it can be used in a huge amount of fields. Last couple of years we got models which really can work at realtime and have close to SOTA performance. But they can’t be the best out of the box for every task. Let’s dig into it.

What’s our task?

This task is just an example, idea can be used in a ton of cases. Let’s say we have a task to detect a damage on the road. Camera is inside the car and car is going pretty fast, we would like our detector to be fast too, so we took YOLOv5s.

It’s good, but sometimes can have a False Positive - detect a shadow instead of a damage (or anything else which is similar looking to the damage). Or it can have a False Negative - didn’t spot a damage.

We can play with confidence level to deal with that, we can add more data to training dataset, we can add more background images, we can customise architecture for our case, tinker with hyperparameters or augmentation, but all of that might not give needed results or be impractical, time consuming, expensive.

What is the other way?

Let’s use two models. One - for detection with low confidence level to get rid of False Negatives and second one - for classification to get rid of False Positives. Let’s discuss our architecture a little bit more.

First step - use fast detector. It detects an object which should be a damage on the road, then crops it.

Second step - use classifier on that crop with estimated damage and get validation if it really is a damage. It’s a good idea to try different classifiers to get best results. EfficientNet is a good start (but there are a lot of pretrained models both in PyTorch and TensorFlow). Also it is important to tune it a little bit (choose how many layers to retrain, how to rebuild the head).

In this case detector can be trained only on damaged road, and classifier should be trained on damaged road, background and other things, like dirt, shadows, which could be miss predicted by detector.

Why is it better?

Accuracy: Main goal of our detector is to detect an object which it does pretty good. The problem is in classification part of detector model (it can classify background as a target. aka False Positive). Here comes a classifier, witch is tuned for classification in our exact case. Classifier can be trained on several labels such as target (defect) or background (dirt, shadows or other stuff). Classifier also is easier to retrain on new data (labelling and training are faster)
Speed: Every frame is processed by the detector (first model), but classifier (second model) is used only at cases, when possible damage was detected. So in this case we still have real time speed as we are bottlenecked only with speed of the first model.

Let’s see an example of precision and recall using this technique:

So we drastically reduced False Positive rate, but lost a little bit of Recall. This technique might be especially useful in cases, where your False Positives are more critical than False Negatives, as you really don’t want your system to spam (that might easily happen with 24/7 working systems).

So what we got in the end?

With this architecture it is easy to get best of both worlds - fast detector and accurate classifier. You don’t have to choose whole model just because it is fast, but not really accurate. And this solution is a lot faster to deliver than customising detector architecture and retraining model from scratch, especially in real world tasks, when data is always a bottleneck.

And this architecture is pretty universal. It is good, when you really want to classify your target accurate on your detection tasks.

It’s also important to say that you always would want to tune your models for best performance, retrain on new data and tune hyperparameters, after you got a baseline or even a good working solution.

Thanks for your attention!