190 reads

Why Deep Learning is not Enough for Video Content Analysis

by Pavel SaskovecJuly 15th, 2022

Too Long; Didn't Read

Deep Learning gets a ton of traction from technology enthusiasts, but can it match the effectiveness standards that the public hold it to? The notion sounds powerful, doesn’t it? Both “Artificial Intelligence” and “Deep Learning” often seem capable of almost anything — just like humans are. But that’s the thing — it never was. The number of layers is what makes those networks “deep”, though many might think that the notion refers to the level of content understanding. On the other hand, the human brain can solve more complex tasks with ease with the help of intuition.

Companies Mentioned

featured image - Why Deep Learning is not Enough for Video Content Analysis

Deep Learning gets a ton of traction from technology enthusiasts. But can it match the effectiveness standards that the public holds it to?

The notion sounds powerful, doesn’t it? Both “Artificial Intelligence” and “Deep Learning” often seem capable of almost anything — just like humans are. But as claimed by David Ferrucci, the lead investigator at IBM’s Watson, AI is the most effective at narrow tasks, and that effectiveness is what makes the public perceive AI as all-mighty.

Unfortunately, it is not the case.

What’s more? - it never was.

Deep Learning ain’t as deep as you used to think.

When we talk about Deep Learning, most of the time we mean Deep Neural Networks with a ton of node layers. The number of layers is what makes those networks “deep”, though many might think that the notion refers to the level of content understanding.

The layers that compile the deep neural networks usually are:

The input layer;
2+ hidden layers;
The output layer.

They allow the network to solve tasks faster and with more accuracy than humans, but here’s some unpleasant truth: it’s not that accurate with conflicting data.

That picture of a dog demonstrates the issue perfectly. As a neural network would say, that picture of a tiger demonstrates the issue perfectly.

You can see what happens if the condition isn’t there.

The efficiency of DL drops drastically. On the other hand, the human brain can solve more complex tasks with ease with the help of intuition, personal experience, and general knowledge.

So let’s explore the areas where Deep Learning proves to be helpful and where it falls short. In terms of the good old video analysis thing, and of course, our approach to it.

We kick things off with the use cases where DL shines the brightest.

Deep Learning aye

Just like we said: Deep Learning provides great results when it deals with narrow tasks, when it has a set of not controversial information.

Applied to content analysis tasks, it has no problem in recognizing patterns in video, pictures, and audio files.

Let’s go over each task separately as we take a closer look at the details

Mapping the data to a lower-dimensional space

Say you got an image you need to analyze. It has a resolution of 1280x720, which roughly translates to a million pixels. And a million pixels roughly translate to “a ton of pixels”. So it is hard to deal with that volume of data.

Here’s how DL approaches the situation.

It takes the picture and passes it through hidden layers to extract the prominent features of it. It uses those features to make a version more digestible for further analysis.

Voice recognition and natural language processing

Voice assistants are a great example of how good Deep Learning is at analyzing audio content.

Hey, they are good enough so companies use them to pick out keywords from people's conversations. Great. Creepy, but great.

Speech-to-text transformation and natural language processing are considered narrow tasks, so it’s no surprise that Deep Learning handles it like nobody’s business.

That is exactly why we implemented it in the first version of our text summarization module based on audio analysis.

So, for fishing out keywords, it works great. But we need it to be better and actually make sense of the content it “hears”. Text summarization depends on it greatly: nobody appreciates critical information missing, right?

The data diversity calls for something more complex than a neural network to handle the job.

Face recognition

When we are talking about faces — Deep Learning got your back. Modern neural networks even beat the human eye at recognizing faces, and you can forget about it on a scale. Who remembers hundreds of faces? DL does.

Basically, a neural network describes the recognized faces as vectors and uses that information to identify them in other scenes of the footage.

Based on this, we created a solution with some extra features.

Let’s see how it works.

The tool starts off as a regular neural network, analyzing the footage and presenting a face as a vector with a descriptor. The descriptor includes the unique features that the network relies on to recognize a certain face when it appears repeatedly.

And then, the magic happens.

We use mathematical algorithms and Machine Learning to do the following:

Make an ID of all the characters appearing in the video;
Analyze the footage to distinguish between main and secondary characters;
Tag the frame that represents each main and secondary character;
Find each and every scene where the characters appear.

And that’s it. In this case, Deep Learning can only handle the narrow portion of face detection and recognition.

But the decision portion — like identifying characters relevant to the story — can only be performed by more complex tech.

Deep Learning naye

Now, let’s explore the areas in which Deep Learning might not be the ideal solution.

Decision-making

That's the big one. Some of that recognition stuff that everyone is so stoked about, like facial recognition we mentioned before, requires more than a narrow approach “look at the thing — remember the thing — recognize the thing”.

But neural networks lack the ability to understand what they recognize. So, the tasks that involve some kind of decision are no-go.

Don’t believe us? Well, we tried it out.

We used Deep Learning to try and see how it recognizes end-credits for our CognitiveSkip™ pipeline. The neural networks can detect fruit, cats, cars, faces, and so on.

What could go wrong? Credits are just plain text, right?

Well, not to a neural network. It thought that the credits were a pipe.

Goes without saying that we were not satisfied with that pipeline.

We tried another Deep Learning model. It told us that the credits were a website.

Even though the credits are just text, we used object recognition for analysis because regular text detection can’t tell the difference between text and a movie scene. And that difference is what makes safe-skipping possible so that the movie viewers wouldn’t miss scenes important to the plot.

Look, it is easier for humans: we know where to expect end credits, and we know what the post-credit scene is. But the neural network is limited to the dataset that it was trained on.

As a result, you get pipes and websites.

That is why we stay away from using DL for decision-making. Our cognitive solutions pipelines are divided into two stages, imitating two cognitive functions:

Representations stage — here Deep Learning is used to “see” the required object;
Cognitive decisions stage — here the analysis of the data extracted by the DL model happens.

That’s where the solution makes a decision, imitating the thought process of a human brain.

Human-like thought process

Say you sit down and watch a video on YouTube. It may seem like a simple task, but in fact, your brain does a lot of stuff so you could perceive the content, analyze it, have a reaction to it, and memorize something.

Automating content generation to the point where the software can generate sports highlight compilations or movie trailers is tricky because of that background stuff that our brain does. Humans already know what moments in the video can cause different emotions: what scene triggers laughter, what sounds make us uneasy, and so on.

The technology, obviously, does not.

But, that information is crucial to creating a pipeline that would select the right parts of the video for content generation. And Deep Learning can just offer any sort of solution to that problem.

Neural networks work well with objects but fail at associating those objects with emotion. So, it just can’t tell the difference between a random ball rolling through the grass and a last-second goal.

So, to create a more sophisticated content analysis system, we had to find a way to describe the emotional aspect of the video content to a machine. Our team looked into using standard computer vision, some machine learning, and other tech that could help them imitate human perception.

At the end of the day, that approach has worked for us. Everybody has its own place: Deep Learning models capture the things in the footage, and more complex technology makes sure that those are the things we need.

Going back to our human metaphor, DL are the human “eyes”, while a ton of other tools help us work out the imitation of the human “brain”.

Bottom line

More often than not, Deep Learning is credited with the things that it could not possibly do. Many people see it as the ultimate technological tool that can solve any automation problem imaginable.

The truth is that DL performs well with narrow tasks, but can’t go beyond that.

It falls short of delivering results that require a deeper understanding of the analyzed content.

For now, the AIHunters team applies Deep Learning where its strength is most needed — at the representations stage of our analysis. The rest is handled by cognitive tech that can work with incomplete information, make sense of the context, and make decisions based on that.

We are still on our way to improving the algorithms to match the level of human intelligence and awareness.