8 min read

Natural Language Processing: A Short Introduction To Get You Started

Published on
May 14, 2019
Author
Nóra Ambróz
Nóra Ambróz
Senior Data Engineer
Subscribe to our newsletter
Subscribe
Natural Language Processing: A Short Introduction To Get You Started

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that aims to improve communication between humans and computers. People speak in languages defined by error-prone rules. They make mistakes and use illogical statements, yet they still understand each other pretty well. Computers, on the other hand, need a perfect structure, preferably already in the form of ones and zeros. Since few of us can use raw binary and machines still struggle with the concept of sarcasm, there’s certainly a gap to bridge.

“Hello!”

“01001000 01101001 00100001”

Success in business depends on data analysis, as it gives the direction for improvement. But unlike spreadsheets and tables, natural language is an unstructured source. The textual and verbal data people generate every day exceeds human processing powers. The solution, therefore, is to automatically extract the information that’s relevant. Natural Language Processing allows machines to figure out the complex meaning in our sentences. It works in the background of many services from chatbots through virtual assistants to social media trends tracking. Read this blog post to learn more about the difficulties and solutions of various NLP-related problems.

The Difficulties of Natural Language Processing

Natural language is astoundingly complex. A human being can still intuitively learn and understand it without knowing anything about grammar or conventions. Machines, however, master their language skills very differently.

“Mommy?”

“A colloquial noun, a synonym of mother”

Before NLP was interconnected with AI, there were rule-based approaches to human-computer communication. These models tried to describe every single rule of a language, an immense task in itself. Then they set the weight or precedence of the principles and applied the formula to the input. The results were decent until the model encountered something outside of its rules. For example, a misspelled word, an unknown name, or a pun.

Speech

Humans usually don’t realize how error-ridden speech is. A speaker continually makes small grammar mistakes, changes their mind mid-sentence, and mixes in terms from other languages. But that’s not all. What about words with inconsistent pronunciation in different dialects, disfluencies, and speech disorders like mumbling and stuttering?

“So yesterday I was getting out of my car when John came around and… do you know John? The tall guy from work? Anyway, he came to me to talk about Stephanie… or was it…? Whatever. So… the point is I didn’t know… do you even pay attention?”

Text

Written text isn’t a lot better. Omitted punctuation, faulty wording, typos, and many other inaccuracies can obfuscate the meaning. Even though it seems more standardized than speech, there are still unruly areas. Emojis are a good example. They exist outside of any grammatical rules but are still important symbols of the natural language.

“🐎%?”

Same word, different meaning

Besides mistakes, contextual meaning also brings a challenge to language processing. Machines are notorious for taking everything a bit too literally. While humans can usually handle a phrase, poetic meaning, or a metaphor, a computer may get confused. Choosing the right meaning of a homograph, heteronym, or homonym is similarly difficult. These are words that have the same pronunciation and/or spelling but have different applications. Not even Google Translate can handle them properly as it picks the wrong word with hilarious results sometimes.

Sarcasm, irony, and jokes occupy a still more special space since neither people nor rules can describe them with complete surety. Such ambiguity can cause complications even if the grammatical rules are accompanied by Machine Learning.

“Oh, that is great news indeed!”

NLP is capable of overcoming many obstacles that come from a language’s illogical, error-prone nature. Even though computers don’t yet understand English as well as we do, they’re already able to provide useful insights. There are varying Natural Language Processing methods suited to different tasks to achieve good results.

Natural Language Processing With Text Classification

When it comes to NLP, Text Classification is almost cheating…and with good reason. A well-trained classifier can produce highly accurate analyses and predictions without actual natural language understanding. The training itself is both the solution’s strength and its greatest weakness.

A text classification model compares the input to statistics made from the training data, then decides on a label. How the rules of labeling work, no one will ever exactly know, since they are laid down by the classifier. This approach is more adaptable than the grammar- and rule-based systems. Language changes rapidly and organically, but with fresh statistics, the system can remain up-to-date. A new phrase, habit, or expression doesn’t matter until the classifier can connect it with a label. This flexibility comes at the price of huge amounts of training data.

Text Classification In Practice

A good example that clearly points out the weakness of text classification is a sentiment analyzer. Let’s assume that the model should predict user ratings from reviews. The end-to-end trained classifier may pair longer sentences with a more negative sentiment, while positive opinions are less wordy. People usually tend to enumerate their problems when they are dissatisfied. On the other hand, happy customers are more likely to leave short, emotionally descriptive comments. The result is that most of the time the classifier will be correct. But what happens with reviews like these?

“Horrible place! :(“

“I absolutely recommend this restaurant to everyone, who wants to have a real culinary experience. Their menu is a complete work of art, while the staff is polite and very professional. The best part is when the dessert comes!”

Of course, there is a chance that the classifier will correctly pick up and pair words like horrible, polite, and best with sentiments. Then it places the pairing rule’s precedence above the other’s. But if it doesn’t, it’s usually easier to feed the model more training data until it spits out the correct result, than to try and retune its code.

The solution can usually successfully handle spam and sensitive content filtering, support ticket routing, language recognition, content categorization, and duplication detection. On the other hand, text classification is not a viable option when a deeper understanding of the meaning is required.

Natural Language Understanding

To squeeze more precise meaning from a sentence, it’s necessary to first break it into smaller, more understandable chunks. After that, the tiny pieces can be connected through layers and examined together up to the top level. A Natural Language Processing pipeline consists of steps providing both syntactic and semantic analysis.

Let’s see how they build on each other.

Syntax Analysis

The first stage of syntax analysis is usually partitioning the data. The text corpus is breakable to chapters and paragraphs on a high level. This helps to identify a larger, more general context for a thought, which is usually expressed in a sentence. A substantial part of Natural Language Understanding (NLU) analysis happens within this tighter environment. Punctuation can give a decent base to split these apart, but it is unreliable in more spontaneously written text like chat messages. If the document’s formatting is not clean, compound rules that are specific to the situation can help.

Words

The next step is to identify the building blocks of sentences: punctuation and words. Commonly spaces or new lines separate them. NLP models usually tokenize these parts as symbols. The bag-of-words model simplifies analysis to extract the meaning of words but disregards grammar. This model is well suited to handling term frequency and distinct word count.

Words also have their inner structure, which is worth analysis. Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item. While stemming reduces inflected (or sometimes derived) words to their word stem, base, or root. In this phase, there’s a large difference between how languages work. Some tongues mostly use inflection, others prefer independent word stems as suffixes and prefixes to fine-tune meaning. Let’s look through a few examples to see why this can be problematic.

Isolating languages

Isolating languages avoid inflection. Usually, one word consists of one morpheme, which is a meaningful morphological unit of a language that cannot be further divided. The fusion between the morphemes is loose. Besides Mandarin Chinese and Vietnamese, English also has qualities typical of an isolating language, and here is an Indonesian example.

Orang = person

Orang-orang = people

Fusional languages

Inflectional morphemes are the most important characteristics of fusional languages. They can express grammatical relations as well as semantic meaning by changing the stem itself. Among other European languages German, Italian, Latin, and Spanish all belong to this group. English also has a few inflectional morphemes, such as -ing and -ed. Agglutinative languages

As the name indicates, this type of language glues its morphemes to the stem. Since the boundaries within the union remain usually clear, stemming and analysis are easier than with fusional languages. Finnish, Hungarian, Estonian, and Turkish languages are all agglutinative. A good example is Hungarian verb conjugation. Instead of pronouns – like he, she, we – this language uses affixes.

LátokI see

Látszyou See

Látunkwe see

Languages usually exhibit characteristics from more than type, which complicates things further. Since the grammar differences are significant, the NLP model’s lemmatization and the stemming process should fit the language.

Part of Speech

The next step is to figure out which part of the speech – POS – belongs to the symbol. Is this word a noun, an adjective, or a verb? How does it function? Sometimes deep learning handles the POS tagging, but statistics and grammar-based solutions are also options.

Around this point, many models detect stopwords and remove them. Typical stopwords are the, an, and a, which are ignored because they don’t carry meaning in themselves. The selection usually happens based on a predefined list that fits the goal of the process.

Dependency Parsing

The following layer is dependency parsing. This uncovers the relationship between the words. During this complex process, the model chooses a root word in the sentence and makes a tree of parent and child words around it. The result will show the subject of the sentence, conjunctions, adverbs, determiners, modifiers, and other relations. A parent word and its children will represent a phrase, a meaningful, independent structure within the sentence. The situation is not always clear, however, so sometimes the model can only guess based on similar cases. To minimize the occurrence of uncertain cases, dependency parsing is still a constantly researched field.

Natural Language Processing: A Short Introduction To Get You Started

Semantic Analysis

All the syntax examinations serve to lay down the base for semantic analysis. The words tagged and chained together are ready for information extraction. The process can continue with Named Entity Recognition (NER) that links the words to the ideas they represent. This step labels nouns with tags like individuals, companies, and dates. In hard-to-decide cases word sense disambiguation derives the right choice from the context.

Since there is no way to create a comprehensive list of an entity kind, each is described by characteristics. So when NER identifies entities, it takes punctuation, environment, POS tags, and others into account. For example, an individual’s name is distinguishable not because every person is listed. Instead the capital letters and its function in the sentence identifies it as a name.

To access even more information, coreference resolution can map pronouns to nouns. The meaning of he, she, and it is purely determined by context, and exceeds the boundaries of a sentence. Finding every word that references the same noun is easy for the human brain, but a great challenge for any NLP model.

*Bill and Tom spent the day playing. It was fun, so they are happy.*

Using Natural Language Processing

Naturally, some layers are skippable or work well even in a different order. The exact model should always suit the final goal. NLP excels at text summaries, where it can shorten the corpus faster than humans and without bias. Besides general summaries, it’s possible to extract certain parts, create an abstract, or index the text for queries. Armed with Natural Language Understanding, a computer can find relevant information in unstructured data. Phone calls, spontaneous chat messages, handwritten notes, and many other sources do no longer require human processing. Even translations are getting more and more accurate thanks to NLP, even though they’re immensely complex tasks for a machine.

If you wish to try NLP for yourself, it is worth knowing Python or Cython somewhat, since they have a wide variety of libraries. spaCy offers deep learning capabilities as well as high performance and a fairly friendly learning curve. It can handle tokenization, NER, dependency parsing, POS tagging, and syntax-based sentence segmentation. Textacy is built on top of spaCy and expands its features with further flexibility. It can clean datasets, and creates visualized statistics or comparisons. NeuralCoref also integrates into spaCy’s pipelines and provides coreference resolution options.

Summary

This post only scratched the surface of NLP. First, we learned about the gap between human and computer communication, and the hardships it causes. Then came a look at the incredible complexity and diversity of languages, that also creates another set of challenges.

As for the solutions, we mentioned rule-based approaches followed by an overview of the strengths and weaknesses of Text Classification. Finally, we went a bit deeper into Natural Language Understanding. Besides learning about the layers of syntax and semantic analysis, we touched briefly on use cases and NLP technologies.

Hopefully, this short introduction piqued your interest to dive into the adventures of Natural Language Processing.

Author
Nóra Ambróz
Senior Data Engineer
Subscribe to our newsletter
Subscribe