The term ‘curation’ is commonly associated with museums or libraries, not data science. However, much like the work that’s done on rare paintings or books, data curation tools make the most important data easily accessible to engineers as they build complex machine learning models.Without curation, data is difficult to find, analyze, and interpret. Data curation tools provide meaningful insights and enduring access to all your data in one place. In this article, we’ll dive into the importance of data curation for computer vision specifically, as well as review the top data curation tools on the market today.
What is data curation?
Data curation is the act of organizing, enhancing, and preserving data for future use. In machine learning, data curation describes the management of data throughout its lifecycle: from its collection and initially storage, to the time it is archived for future re-use.This process is all the more important for computer vision engineers, who deal with massive amounts of visual data on a daily basis. Instead of using manual methods such as writing ETL jobs to extract insights, data curation tools provide a streamlined way to access the right data whenever you need to.
The importance of data curation for machine learning
Under the hood, data curation tools directly influence computer vision model performance. Using data curation tools, engineers can get a better understanding of the data they’ve collected, identify the most important subsets and edge cases, and curate custom training datasets to feed back into their models.
The best data curation tools enable you to:
- Visualize large scale data: Make it easy to obtain insights on key metrics, as well as the general distribution and diversity of your datasets regardless of sensor type and format.
- Enable data discovery and retrieval: Quickly search, filter, and sort through the entire data lake by making all features queryable and easily accessible.
- Curate diverse scenarios: Identify the most interesting segments within your dataset, and manipulate them within the tool to create completely customized training sets.
- Seamlessly integrate: The tool should fit well within your existing workflows and toolset.
What are the best data curation tools for computer vision?
With an overwhelming amount of AI products and platforms popping up year after year, how do you know which will provide the most value? Read on below to find out which data curation tool is the best fit for your computer vision project.
1. Aquarium Learning
Aquarium is a data management platform that aims to make it easy to identify labeling errors and model failures. With Aquarium, users can version and combine model predictions with their ground truth. Aquarium is especially focused on curating and maintaining training datasets, catering less to raw data management use cases. This is because data exploration in Aquarium is predominantly tied to model predictions and ground truth labels. Users can access Aquarium via their cloud platform or API. However, they currently do not offer on-premise or VPC deployments, and there are no external integrations.
- Wide range of use cases - Aquarium supports image, 3D, audio, and text data. They also support multiple annotation types, such as classification, detection, and segmentation.
- Interactive model evaluation - Users can manipulate evaluation thresholds and obtain interactive visualizations to obtain required samples quickly.
- Collaborative features - Users can collaborate with each other on the Aquarium platform to build data subsets, associate them with issues, and identify new data for annotation.
2. FiftyOne
Developed by Voxel51, FiftyOne is an open-source tool to visualize and interpret computer vision datasets. The tool is made up of three components: the Python library, the web app (GUI), and the Brain. FiftyOne does not contain any auto-tagging capabilities, and therefore works best with datasets that have previously been annotated. Furthermore, the tool only supports image and video data, and does not work for multimodal sensor datasets at this time.Unlike other tools, FiftyOne is designed to be used by individual developers rather than teams, functioning like a programming IDE. Today, the platform lacks collaborative features; for example, a single instance cannot host multiple user accounts.
- Model & dataset zoo - FiftyOne taps into TF and Pytorch dataset zoos to provide access to a variety of open datasets and open-source models.
- Advanced data analysis - Via the Brain, a separate closed-source Python package, users can quantitatively assess the uniqueness, mistakenness, and hardness of data.
- External integrations - FiftyOne directly integrates with popular annotation tools such as . They also have tight integrations with Jupyter and Colab Notebooks, making it easy for users to run FiftyOne through Python notebooks.
3. Scale Nucleus
Launched in late 2020 by , Nucleus is one of the newest data curation tools to hit the market. The Nucleus platform allows users to collaboratively search through image data for model failures. As of now, Nucleus only supports image data, with no support for 3D sensor fusion, video, or text data.Users can access Nucleus via their cloud platform, API or Python SDK. Currently, Nucleus does not support on-premise deployability.
- Visual similarity - Users can search for visually similar images based on one or multiple base samples and associate custom tags with them.
- Metadata schemas - Using the Nucleus SDK, users can create flexible metadata schemas. Nucleus provides smart methods to detect and create schemas using the annotation format provided.
- Model versioning - Users can create model entities and associate corresponding runs with them. Hence, models can be versioned based on runs (dataset & predictions).
4. Clarifai
Clarifai is an end-to-end solution for labeling, searching & modeling unstructured data. One of the first AI startups, they provide a platform for modeling image, video, and text data. While Clarifai’s original focus was enabling users to build custom models, they’ve recently added several data curation features including auto-tagging, visual search, and annotations. Ultimately, Clarifai is more of a modelling platform and less of a developer tool. They are best suited for relatively inexperienced teams getting started with ML use cases.
- Ready-to-use model gallery - Clarifai offers a broad library of pre-built AI models, including anything from food to facial recognition.
- Wide range of data types - The platform supports for image, video, and text data.
- Model customization - With the platform, users can customize or retrain existing models or create new ones from scratch.
- Data annotation - In addition to their modelling platform, Clarifai offers fully managed annotation services through their Scribe LabelForce data labeling service.
5. SiaSearch
is a data management platform for computer vision data. Consisting of a scalable metadata catalog and query engine, SiaSearch enables developers to easily search through visual data, add metadata to frames and sequences, as well as assemble custom subsets of data for training or testing. With deep roots in autonomous driving, the SiaSearch platform is used by many OEMs, Tier 1s and tech companies. Aside from autonomous driving, SiaSearch also has solutions for robotics, retail, and more.
- Specialized in sensor data - One of the only tools that can support 3D sensor fusion data, SiaSearch can analyze large volumes of unstructured sensor data, providing insights at the frame and sequence level.
- Auto-tagging capabilities - SiaSearch employs a large catalog of pre-trained extractors to automatically add frame-level, contextual metadata to raw data. Additionally, SiaSearch provides a toolbox for quick extractor development, allowing developers to integrate their own extractors.
- Fast performance - The SiaSearch platform features a unique, proprietary architecture that combines numeric and sequence-based queries to enable noticeably faster performance.
- Flexible workflows & integrations - Users can access SiaSearch via their web-based GUI or programmatic API. SiaSearch also supports cloud or on-premise deployment for enterprise users.
Interested in data curation?
The right data curation tool can dramatically reduce the time spent on manual processes, allowing engineers to focus on what really matters - building great models.Lead image via Tobias Fischer on Unsplash
Originally published by Clemens Viernickel on: and has been reposted with permission.