2,795 reads

Text Embedding Explained: How AI Understands Words

by Louis BouchardDecember 3rd, 2022

Too Long; Didn't Read

Large language models. You must’ve heard these words before. They represent a specific type of machine learning-based algorithms that understand and can generate language, a field often called natural language processing or NLP. You’ve certainly heard of the most known and powerful language model: GPT-3. GPT-3, as I’ve described in the video covering it is able to take language, understand it and generate language in return. But be careful here; it doesn’t really understand it. In fact, it’s far from understanding. GPT-3 and other language-based models merely use what we call dictionaries of words to represent them as numbers, remember their positions in the sentence, and that’s it. Let's dive into those powerful machine learning models and try to understand what they see instead of words, called word embeddings, and how to produce them with an example provided by Cohere.

featured image - Text Embedding Explained: How AI Understands Words

Large language models.

You must’ve heard these words before. They represent a specific type of machine learning-based algorithm that understand and can generate language, a field often called natural language processing or NLP.

You’ve certainly heard of the most known and powerful language model: . GPT-3, as I’ve described in the video covering it is able to take language, understand it and generate language in return. But be careful here; it doesn’t really understand it. In fact, it’s far from understanding. GPT-3 and other language-based models merely use what we call dictionaries of words to represent them as numbers, remember their positions in the sentence, and that’s it.Let's dive into those powerful machine learning models and try to understand what they see instead of words, called word embeddings, and how to produce them with an example provided by Cohere.Learn more in the video...

References

►Read the full article:
►BERT Word Embeddings Tutorial:
►Cohere's Notebook from the code example:
►Cohere Repos focused on embeddings:
►My Newsletter (A new AI application explained weekly to your emails!):

Video Transcript

0:07language models you must have heard0:10these words before they represent a0:13specific type of machine learning0:14algorithms that understand and can0:16generate language a field often called0:19natural language processing or NLP0:22you've certainly heard of the most known0:24and Powerful language models like gpt30:26gpt3 as I've described in the video0:28covering it is able to take language0:30understand it and generate language in0:33return but be careful here it doesn't0:35really understand it in fact it's far0:38from understanding gbd3 and other0:41language-based models merely use what we0:44call dictionaries of words to represent0:46them as numbers remember their positions0:49in the sentence and that's it using a0:52few numbers and positional numbers0:53called embeddings they are able to0:55regroup similar sentences which also0:58means that they are able to kind of1:00understand sentences by comparing them1:02to known sentences like our data set1:05it's the same process for image sentence1:07models that take your sentence to1:10generate an image they do not really1:11understand it but they can compare it to1:13similar images producing some sort of1:16understanding of the concepts in your1:18sentence in this video we will have a1:20look at what those powerful machine1:22learning models see instead of words1:24called word embeddings and how to1:27produce them with an example provided by1:29the sponsor of this video a great1:31company in the NLP field cohere which I1:35will talk about at the end of the video1:36as they have a fantastic platform for1:39NLP we've talked about embeddings and1:42gpt3 but what's the link between the two1:44emittings are what is seen by the models1:47and how they process the words we know1:50and why use embeddings well because as1:53of now machines cannot process words and1:56we need numbers in order to train those1:59large models thanks to our carefully2:01built data set we can use mathematics to2:04measure the distance between embeddings2:06and correct our Network based on this2:08distance iteratively getting our2:10predictions closer to the real meaning2:12and improving the results and meetings2:15are also what the models like clip2:17stable diffusion or Dali used to2:19understand sentences and generate images2:21this is done by comparing both images2:24and text in the same embedding space2:26meaning that the model does not2:28understand either text or images but it2:31can understand if an image is similar to2:33a specific text or not so if we find2:36enough image caption pairs we can train2:38a huge and Powerful model like Dali to2:41take a sentence embed it find its2:43nearest image clone and generate it in2:46return so machine learning with text is2:48all about comparing embeddings but how2:51do we get those embeddings we get them2:53using another model trained to find the2:56best way to generate similar embeddings2:58for similar sentences while keeping the3:01differences in meaning for similar words3:03compared to using a straight one for one3:06dictionary the sentences are usually3:08represented with special tokens marking3:10the beginning and end of our text then3:13as I said we have our poses from all3:15embeddings which indicate the position3:17of each word relative to each other3:19often using sinusoidal functions I3:22linked a great article about this in the3:25description if you'd like to learn more3:26finally we have our word embeddings we3:29start with all our words being split3:31into an array just like a table of words3:34starting now there are no longer words3:36they are just tokens or numbers from the3:40whole English dictionary you can see3:42here that all the words now are3:44represented by a number indicating where3:46they are in the dictionary thus having3:49the same number for the word Bank even3:51though their meaning are different in3:53the sentence we have now we need to add3:56a little bit of intelligence to that but3:58not too much this is done thanks to a4:00model trained to take this new list of4:03numbers and further encode it into4:05another list of numbers that better4:08represent the sentence for example it4:10will no longer have the same embedding4:13for the two words bank here this is4:15possible because the model used to do4:17that has been trained on a lot of4:19annotated Text data and learned to4:21encode similar meaning sentences next to4:24each other and opposite sentences far4:27from each other thus allowing our4:29embeddings to be less biased by our4:31choice of words then the initial simple4:34one for one word embedding we initially4:37had here's what using imagings looks4:39like in a very short NLP example there4:42are more links below to learn more about4:44embeddings and how to code it yourself4:46here we will take some Hacker News posts4:49and build a model label to retrieve the4:51most similar post of a new input4:53sentence to start we need a data set in4:56this case it is a pre-embedded set of4:583000 Hacker News posts that have already5:01been emitted into numbers then we build5:04a memory saving all those embeddings for5:07future comparison we basically just5:09saved these embeddings in an efficient5:11way when a new query is done for example5:13here asking what is your most profound5:16life inside you can generate its5:18embedding using the same embedding5:20Network usually it is bird or a version5:23of it and we compare the distance5:25between the embedding space to all other5:27Hacker News posts in our memory note5:30that it's really important here to5:32always use the same network whether for5:34generating your data set or for querying5:36it as I said there is no real5:38intelligence here nor that it actually5:40understands the words it just has been5:42trained to embed similar sentences5:45nearby in the unmanning space nothing5:47more if you send your sentence to a5:50different network to generate an5:51embedding and compare the embedding to5:53the ones you had from another Network5:55nothing will work it will just be like5:58the nice people that try to talk to me5:59in Hebrew at eccv last week it just6:02wasn't in an embedding space my brain6:04could understand fortunately for us our6:06brain can learn to transfer from one6:08embedding space to another as I can with6:11French and English but it requires a lot6:13of work and practice and it's the same6:16for machines anyways coming back to our6:18problem we could find the most similar6:21posts that's pretty cool but how could6:23we achieve this as I mentioned it's6:25because of the network birth in this6:28case it learns to create similar6:30embeddings from similar sentences we can6:32even visualize it in two Dimensions like6:35this where you can see how two similar6:37points represent similar subjects you6:39can do many other things once you have6:41those embeddings like extracting6:43keywords performing a semantic search6:45doing sentiment analysis or even6:47generating images as we said and6:49demonstrated in previous videos I have a6:52lot of videos covering those and listed6:55a few interesting notebooks to learn to6:57play with encodings thanks to the cohere6:59team now let me talk a little bit about7:02kohilu as they are highly relevant to7:05this video cook here provides a7:07everything you need if you are working7:09in the NLP field including a super7:11simple way to use embedding models in7:14your application literally with just an7:16API call you can embed the text without7:18knowing anything about how the embedding7:21models work the API does it for you in7:23the background here you can see the7:25semantic search notebook that uses7:27cohere API to create embeddings of an7:30archive of questions and question7:32queries to later perform search of7:34similar questions using cook here you7:37can easily do anything related to text7:39generate categorize and organize at7:42pretty much any scale you can integrate7:44large language models trained on7:46billions of words with a few lines of7:48code and it works in any Library you7:51don't even need machine learning skills7:53to get started they even have learning7:55resources like the recent cohere for7:57ai's colors program that I really like8:00this program is an incredible8:01opportunity for emerging talent in NLP8:04research around the world if selected8:06you will will work alongside their team8:08and have access to a large-scale8:10experimental framework and cohere8:12experts which is pretty cool I also8:15invite you to join their great Discord8:17Community ingeniously called Co Unity I8:21hope you've enjoyed this video and will8:23try out cohere for yourself with the8:25first link below I am sure you will8:27benefit from it thank you very much for8:29watching the whole video and thanks to8:31anyone supporting my work by leaving a8:33like comment or trying out our sponsors8:36that I carefully select for these videos

L O A D I N G
. . . comments & more!