visit
In the first article of this series, I introduced the lifecycle of AI (which consists of training and inference) and discussed blockchain use cases for AI inferences. If you are not familiar with what these terms mean, I suggest reading my first article before this one.
In this article, I will talk about how we can leverage the properties of blockchain for AI training, specifically for the data collection part. The idea is not to give an exhaustive list of AI + blockchain use cases but rather to show in a no-bullshit way that blockchain has some actual use cases for AI and that this space of intersection deserves closer attention.To recap, “training” an AI model is analogous to writing a program in traditional programming, except in this case, instead of writing many lines of code that tell the computer what to do explicitly, you write only a few lines of code and leverage statistics and probability to “teach” the computer what to do, and the raw materials you use to do so is data.
There exist many different types of machine learning algorithms (supervised, unsupervised, reinforcement etc.) whose training process are slightly different. You could also watch for a quick introduction. For now though, just know you need data to train a machine learning model.Data are essential to machine learning. Just like the more books you read on a subject, the more knowledgeable you become, the more data your program has, the better it can get. Decent models with lots of data can often outperform great models with little data. Here is where blockchain comes in, because blockchain, in its essence, is just a distributed database that is open, transparent and immutable.
Use Case #1: Data Common
Nowadays, many people share datasets on the Internet in places such as Github or Kaggle for free in the spirit of the open-source community - this data common has greatly benefitted data scientists around the world working on all sorts of different projects. Here are two ways in which blockchain can help push this trend even further.
First, sharing data on the blockchain makes them truly public and censorship-resistant. Sure people could share datasets on Github or Kaggle, but if one day they decide to take down these datasets for some reason, these data will no longer be available to the world. Alternatively, if a government decides to censor or monopolize certain datasets already published, they could also force the removal of these datasets. Data shared on the blockchain, on the other hand, cannot be taken down. Once they are published on the blockchain, no entity can remove them since no one has ownership over the blockchain. The only way to remove a dataset is to destroy all the computers in the world running the network, or carry out a 51% attack (extremely difficult to do). In addition, whistleblowers can remain anonymous as they publish data on the blockchain, encouraging further sharing of data.
Furthermore, network rewards can be put in place on the blockchain to encourage people who share their datasets on the blockchain. When people share quality datasets, they are rewarded in cryptocurrencies. This incentive mechanism will certainly drive more people to supply more data to the data common, thereby fueling the development of better AI algorithms.
However, one caveat about this use case is that if you are the seller of data, once you sell the data, you no longer have ownership over them. People can copy these data and sell them to whoever else. Here is where another innovative use case comes in: Blockchain can bring compute to the data instead of the other way around, allowing people to “rent” out their data instead of selling them.
Blockchain, on the other hand, can enable bringing models to the data.
Imagine that you are an AI developer, you have a model but lack the data to train it. Amy comes to you and says that she has the data and that if you pay her a certain amount of cryptocurrency, you can train the model on her data.
So you make a payment, send the model over, it gets trained on her data, and then the trained model is returned to you. The whole process is facilitated and escorted by smart contracts. Throughout the process, her data stay invisible to you, yet you end up with a better model. (One project known to work on something this is the Ocean Protocol.)