Natural Language Processing with Python: A Detailed Overview by@princekumar036
186 reads
Natural Language Processing with Python: A Detailed Overview
by February 21st, 2022
Too Long; Didn't Read
NLP is an emerging field at the intersection of linguistics, computer science and artificial intelligence that makes computer understand and generate human language. Python is the most preferred programming language for NLP along with its libraries like NLTK, spaCy, CoreNLP, TextBlob and Genism.
Companies Mentioned
Artificial Intelligence is currently one of the most sought-after technologies, and so is NLP, a subset of AI. Recent years have been marked by tremendous research and development in AI. NLP is an important technology, and we use it every day. This article will give us some perspective on this emerging technology. We will look at its history and some of its typical applications. How and where can we learn NLP to create various real-life projects. We also briefly look at Python, arguably the most favorable programming language for NLP and some of its essential NLP libraries. Along the way, some learning resources have also been recommended.
What is NLP?
First things first: what exactly is NLP? We all know that computers can only understand 0s and 1s, but we humans communicate in various languages, thousands of languages. , to be exact. This creates a deadlock in our interaction with the computers. The traditional solution to this deadlock was a methodical interaction with computers by clicking on various options to perform the desired task. But nowadays, we can achieve this by talking to our computers through Cortana or Siri in our own language, and they can serve our tasks and even talk back to us in the same language. The underlying technology that makes this happen is Natural Language Processing or the NLP.
NLP is an emerging interdisciplinary field of linguistics, computer science and artificial intelligence that makes computers understand and generate human language. This is done by feeding a large amount of data or corpus in case of language data to the computer, which analyses them to understand and generate meanings.
History of NLP
Although NLP has become very popular in recent years, its history goes back to the 1950s. The genesis of all the AI technologies lies with the computer scientist Alan M. Turing and his seminal paper titled “.” The central question of his article was ‘Can machines think?’. In the report, Mr Turing proposed a benchmark to test the intelligence of computers in comparison to humans. He called it the Imitation Game, which later became known as the . Among other criteria was the machine’s ability to ‘understand and speak’ natural languages. The of 1954 was another development early in machine translation. The experiment demonstrated a fully automatic translation from Russian to English. The field of NLP, along with AI, has undergone continuous development since then. The growth in the field of NLP can be divided into three categories based on the underlying method or approach for solving NLP problems described in the next section.
NLP Methods
Rule based NLP (1950s — 1990s)
The earliest methodologies used to achieve natural language processing by computers was based on a pre-defined set of rules, also called Symbolic NLP. A collection of hand-coded rules was fed into the computer, and it yielded results based on that.
The early research in NLP was focused primarily on machine translation. Rule-based machine translation (RBMT) required a thorough linguistic description of the source and the target languages. The basic approach involved two steps: 1) finding a structural equivalent of the source sentence and the output using a parser and an analyzer for the source language and a generator for the target language, and 2) using a bilingual dictionary for a word-to-word translation to finally produce the output sentence.
Given that a human language is pervasive and ambiguous, infinitely many such rules are possible. Obviously, hand-coding such a large number of rules is not possible. Thus, these systems were very narrow and could only produce results in some given scenarios. For example, the much-celebrated Georgetown experiment could only translate some 60 plus sentences from Russian to English.
Statistical NLP (1990s — 2010s)
With the advent of more powerful computers with higher processing speed, it was now possible to process a large amount of data. Statistical NLP took advantage of this, and new algorithms based on machine learning came into being. These algorithms are based on statistical models which make a soft, probabilistic decision to yield output(s). Instead of hard-coding rules as in Rule-based NLP, these systems automatically learn such rules by analyzing large amounts of data fed through real-world parallel corpora. In the case of Statistical NLP, the system does not generate one final output. Instead, it outputs several possible answers with relative probability. The drawback, however, was that its algorithms were challenging to build. It required a complex pipeline of separate sub-tasks like tokenization, parts-of-speech tagging, word sense disambiguation, and many more to finally produce the output.
Neural NLP (2010s — present)
As deep learning became popular during the 2010s, the same was applied to the NLP. Deep neural network-style machine learning methods became widespread in NLP too. This approach, too, uses statistical models to predict the likelihood of an output. However, unlike Statistical NLP, it incorporates the entire sentence in a single integrated model wiping out the need to build that complex pipeline of intermediate subtasks as in statistical models. In this approach, the system uses an artificial neural network. The artificial neural network system, in theory, tries to mimic the neural network of the human brain.
An artificial neural network consists of three layers: input, hidden, and output. The input layer receives the input and sends them to the hidden layer, where all the computations are performed on the data. The output is then transferred to the output layer. The connection between the neurons is called weight. The initial value of weights is set randomly, which changes during artificial neural network learning. These weights are crucial in determining the probability of an output.
Common NLP Tasks
Here is a non-exhaustive list of some of the most common tasks in natural language processing. Note that some of these tasks may not be an end in themselves but serve as subtasks in solving other tasks which have real-world applications.
Tokenisation — separate a continuous text into words
Parts-of-speech tagging — determine the parts of speech of each word in a sentence
Stopword removal — filter out high-frequency words like to, at, the, for, etc
Lemmatization — remove inflections and return base form of the word (e.g., driving → drive)
Coreference resolution — determine which word refers to which words in a sentence/text
Parsing — determine and visualize the parse tree of a sentence
Word sense disambiguation — select contextual meaning of a polysemic word
Named entity recognition — determine the proper nouns in a sentence/text
Relationship extraction — identify relationships among named entities
Optical character recognition (OCR) — determine the text printed in an image
Speech Recognition — convert speech into text
Speech Segmentation — separate a speech into words
Text-to-speech — convert text to speech
Automatic summarisation — produce a summary of a larger text
Grammatical error correction — detect and correct grammatical errors in a text
Machine translation — automatically translate a text from one language to another
Natural language understanding (NLU) — convert text into machine-readable code
Natural language generation (NLG) — make machine produce natural language
Python for NLP
Python is a preferred programming language for NLP. As of February 2022, it is the programming language. Python’s ubiquitous nature and its application to a wide array of fields make it so popular.
While programming languages like Java and R are also used for NLP, Python is a clear winner. Python is easy to learn and understand because of its transparent and straightforward syntax. Python offers perhaps the largest community of developers, which can be really helpful in case the code needs some debugging. In addition, Python seamlessly integrates with other programming languages and tools. Most importantly, Python is backed by an extensive collection of libraries that enables developers to quickly solve NLP tasks.
Resources
: Start with this five-part specialization program on Coursera. It will provide a complete overview of Python programming.
: Read this free to read online book by Al Sweigart for step-by-step instructions and guided projects.
: the official Python tutorial and documentation
NLTK
NLTK — the Natural Language Toolkit — is a suite of open-source Python modules, data sets, and tutorials supporting research and development in Natural Language Processing.
is the most popular library for NLP. It has a massive active community with over 10.4k stars and 2.5k forks on its . It was developed at the University of Pennsylvania by Steven Bird and Edward Loper and was released in 2001. NLTK is freely available for Windows, Mac OS X, and Linux. It has built-in support for more than 100 corpora and trained models. NLTK comes with a free written by its creators, a comprehensive guide for writing Python programs, and working with NLTK. They also have an active on google groups.
Resources:
: Watch this introductory video on freeCodeCamp’s YouTube channel for getting started
: Read the official NLTK book written by its creators for a deeper understanding of NLP and NLTK
spaCy
Industrial-strength Natural Language Processing (NLP) in Python.
is relatively young but very hot right now. Its has more than 22.4k stars and 3.7k forks, much higher than NLTK. It was written in Python and Cython, making it fast and efficient at handling large corpus. It is an industry-ready library and was designed for production usage.
Some features of spaCy are:
It provides support for linguistically-motivated tokenization in more than 60 languages.
It has 64 pre-trained pipelines in 19 different languages.
It provides pretrained transformers like BERT.
It provides functionalities for named entity recognition, part-of-speech-tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more.
Resources:
: freeCodeCamp’s video tutorial for getting started
: The most essential concepts, explained in simple terms
CoreNLP
A Java suite of core NLP tools.
was initially written in Java and was developed at Stanford University. But it is equipped with wrappers for other languages like Python, R, JavaScript, etc. Thus, it is a library that can be used with most programming languages. It is a one-stop-shop destination for all the core NLP functionalities like linguistic annotations for text, including token and sentence boundaries, parts of speech, named entities, numeric and time values, dependency and constituency parses, coreference, sentiment, quote attributions, and relations. CoreNLP currently supports 8 languages: Arabic, Chinese, English, French, German, Hungarian, Italian and Spanish. It has 8.3k stars and 2.6k forks on its .
Resources:
: Official CoreNLP documentation and pipelines
TextBlob
Simplified Text Processing
is a Python library for processing textual data. It provides a simple API for the most common NLP tasks like POS tagging, tokenization, n-grams, etc. It is beginner-friendly and the fastest among other libraries. and is made on top of and . It has 8k stars and 1.1k forks on its at the time of writing this article.
Resources:
: Official TextBlob documentation and quick start guide
Gensim
Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
was created by Radim Řehůřek in 2009. It is implemented in Python and Cython, making it incredibly fast. Plus, all its algorithms are memory-independent, i.e. they can process inputs larger than RAM size. It is mainly used to identify semantic similarities between two documents through vector space and topic modeling. It supports algorithms like Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning. It works on vast data collections from specific categories and provides clear insight. Gensim’s has 12.9k stars and 4.2k forks.
Resources:
: Official documentation and tutorials.
Conclusion
Natural language processing is a sub-field of Artificial Intelligence under active research and development. We can see its practical applications all around us. From automatic captioning in YouTube videos to Chrome automatically translation webpages in foreign languages for us, assistive writing with Grammarly, our iPhone’s keyboard predicting keywords for un and so on. The possibilities are limitless. Natural language processing is indispensable for artificial intelligence and for our future technologies.
References
Bird, Steven, et al. Natural Language Processing with Python. 1st ed, O’Reilly, 2009.
Budditha Hettige. A COMPUTATIONAL GRAMMAR OF SINHALA FOR ENGLISH-SINHALA MACHINE TRANSLATION. 2011. (Datacite), .
Heller, Michael. ‘Study Claims Siri and Google Assistant Are Equal’. Phone Arena, . Accessed 8 Feb. 2022.
Singla, Karan. Methods for Leveraging Lexical Information in SMT. 2015. (Datacite), .
Turing, A. M. ‘I. — COMPUTING MACHINERY AND INTELLIGENCE’. Mind, vol. LIX, no. 236, Oct. 1950, pp. 433–60. (Crossref), .