visit
Almost all computer vision applications rely on annotated images to train, test, and validate the models that power them. Annotating these images can range in complexity from a simple classification to a sophisticated pixel-by-pixel segmentation; the tools that support these tasks vary in quality and sophistication as well. Despite the variation in this process, image annotation is widely known to be a tedious, expensive process.
While many solutions exist to decrease the resources it takes to annotate a single image - assisted annotation tools and pre-annotation, for example - there is another solution that avoids the inevitable hurdles of image annotation entirely: video annotation. Collecting and annotating videos, rather than images, can pay huge dividends not only in the size and robustness of the resulting training dataset, but in the efficiency of the process as well. As long as developers are using the right tool, annotating video has several key advantages over annotating images.
To state the obvious, videos are just collections of images. In more concrete terms, each second of video collected means many individual images to annotate and use for training. Though more data doesn’t necessarily mean better data, videos of entire scenes often contain sufficient variation to train a robust model.
Even with similar frames from the same video, developers can partition seemingly redundant frames into training, testing, and validation datasets, to ensure the sub-datasets represent the common underlying distribution. By repeating this with many videos, developers can ensure the model is robust to varying noise levels and environmental conditions throughout training, validation, and testing.
Annotators can take advantage of the entire video when annotating a single frame, to understand the temporal context of that frame and annotate it with a greater understanding of the scene. This can help annotators identify things like the direction of an object in motion, the class of a partially , or whether an object has been seen in other frames at all. Videos don’t just provide annotators context - they provide a type of context to computers as well.
The context of a video can be harnessed to significantly improve the efficiency of an annotation operation. Various video annotation tools and techniques are able to leverage the information in surrounding frames to annotate frames of a video accurately and quickly.
Extending Annotations through Frames
For objects that don’t move from frame to frame, simply extending an annotation through multiple frames is a quick and simple way to apply a single annotation to many frames.
Bounding Box Interpolation
For objects that move from one frame to the next, a simple linear interpolation between bounding boxes two frames is enough to capture linear motion or size change, and apply that to every frame in between.
Segmentation Tracking
For objects that move from frame to frame, but require the precision of a segmentation task, tools like the Innotescus Tracker help propagate a single seed annotation throughout subsequent frames, saving the annotator significant time in creating complex masks, while still allowing the annotator to supervise the process and edit where necessary.
Tools like those shown above - whether simple or complex - save annotators immense amounts of time while still allowing them to ensure each individual frame is accurately annotated. Though the output will look similar to that of an image annotation task, the speed and efficiency with which it is done is far greater.
In much the same way it provides greater context to annotators, annotated video provides greater context to the model it trains. This added context allows machine learning developers to improve network performance with a wider array of techniques like temporal and kalman filters, and long short-term memory architectures.
Kalman filters and temporal filters allow the model to incorporate information from nearby frames to assist it in making decisions. allows models to filter out misclassifications based on the presence or absence of certain objects in adjacent frames, while Kalman filters use information from nearby frames to estimate the most probable location of an object in a subsequent frame.
and architectures, on the other hand, refer to a broader set of network architectures that implement temporal components in order to ingest and interpret time series data, including video. Because they operate on data with a temporal component, having accurately annotated video is hugely advantageous for training and deploying an RNN or LSTM network architecture.
While the uses for annotated video vary widely, the benefits of annotating video rather than images are quite consistent across applications. Choosing video as the mode of data collection and annotation pays off not only in the amount and quality of annotated data, but in the efficiency of the annotation operation as well.
Tools like those available on the help annotators capture these benefits, and turn the data preparation process into a source of strength rather than a necessary burden.
Previously published at