# An Introduction to Natural Language in Python using spaCy

## Introduction

This tutorial provides a brief introduction to working with natural language (sometimes called "text analytics") in Python, using [spaCy](https://spacy.io/) and related libraries.
Data science teams in industry must work with lots of text, one of the top four categories of data used in machine learning.
Usually that's human-generated text, although not always.

Think about it: how does the "operating system" for business work? Typically, there are contracts (sales contracts, work agreements, partnerships), there are invoices, there are insurance policies, there are regulations and other laws, and so on.
All of those are represented as text.

You may run across a few acronyms: _natural language processing_ (NLP), _natural language understanding_ (NLU), _natural language generation_ (NLG) — which are roughly speaking "read text", "understand meaning", "write text" respectively.
Increasingly these tasks overlap and it becomes difficult to categorize any given feature.

The _spaCy_ framework — along with a wide and growing range of plug-ins and other integrations — provides features for a wide range of natural language tasks.
It's become one of the most widely used natural language libraries in Python for industry use cases, and has quite a large community — and with that, much support for commercialization of research advances as this area continues to evolve rapidly.

## (If you are not in online colab) Getting Started

Check out the excellent _spaCy_ [installation notes](https://spacy.io/usage) for a "configurator" which generates installation commands based on which platforms and natural languages you need to support.

Some people tend to use `pip` while others use `conda`, and there are instructions for both.  For example, to get started with _spaCy_ working with text in English and installed via `conda` on a Linux system:
```
conda install -c conda-forge spacy
python -m spacy download en_core_web_sm
```

BTW, the second line above is a download for language resources (models, etc.) and the `_sm` at the end of the download's name indicates a "small" model. There's also "medium" and "large", albeit those are quite large. Some of the more advanced features depend on the latter, although we won't quite be diving to the bottom of that ocean in this (brief) tutorial.

Now let's load _spaCy_ and run some code:

## (If you are in online colab) Start here!

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

That nlp variable is now your gateway to all things spaCy and loaded with the en_core_web_sm small model for English. Next, let's run a small "document" through the natural language parser:

In [None]:
text = "The weather is really nice today :)"
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

First we created a [doc](https://spacy.io/api/doc) from the text, which is a container for a document and all of its annotations. Then we iterated through the document to see what _spaCy_ had parsed.

Good, but it's a lot of info and a bit difficult to read. Let's reformat the _spaCy_ parse of that sentence as a [pandas](https://pandas.pydata.org/) dataframe:

In [None]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)

df

Much more readable!
In this simple case, the entire document is merely one short sentence.
For each word in that sentence _spaCy_ has created a [token](https://spacy.io/api/token), and we accessed fields in each token to show:

 - raw text
 - [lemma](https://en.wikipedia.org/wiki/Lemma_(morphology)) – a dictionary form of the word
 - [part of speech](https://en.wikipedia.org/wiki/Part_of_speech)
 - a flag for whether the word is a _stopword_ – i.e., a common word that may be filtered out

Next let's use the [displaCy](https://ines.io/blog/developing-displacy) library to visualize the parse tree for that sentence:

In [None]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True)

Does that bring back memories of grade school? Frankly, for those of us coming from more of a computational linguistics background, that diagram sparks joy.

?? No ??

But let's backup for a moment. How do you handle multiple sentences?

There are features for _sentence boundary detection_ (SBD) – also known as _sentence segmentation_ – based on the builtin/default [sentencizer](https://spacy.io/api/sentencizer):

In [None]:
text = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in. Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."

doc = nlp(text)

for sent in doc.sents:
    print(">", sent)

When _spaCy_ creates a document, it uses a principle of _non-destructive tokenization_ meaning that the tokens, sentences, etc., are simply indexes into a long array. In other words, they don't carve the text stream into little pieces. So each sentence is a [span](https://spacy.io/api/span) with a _start_ and an _end_ index into the document array:

In [None]:
# @title
for sent in doc.sents:
    print(">", sent.start, sent.end)

We can index into the document array to pull out the tokens for one sentence:

In [None]:
doc[48:54]

Or simply index into a specific token, such as the verb `went` in the last sentence:

In [None]:
token = doc[51]
print(token.text, token.lemma_, token.pos_)

At this point we can parse a document, segment that document into sentences, then look at annotations about the tokens in each sentence. That's a good start.

# Side note about making text (more) readable
First, a little housekeeping:

In [None]:
import sys
import warnings

warnings.filterwarnings("ignore")

## Now lets try to understand aliens

In [None]:
x = "Rinôçérôse screams ﬂow not unlike an encyclopædia, \
'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D' to Spın̈al Tap."

type(x)

In [None]:
repr(x)

In [None]:
ascii(x)

In [None]:
x.encode('utf8')

In [None]:
x.encode('ascii', 'ignore')

In [None]:
import unicodedata

unicodedata.normalize('NFKD', x).encode('ascii','ignore')

Of course, this example is a little constructed. It likely does not happen often that people write `technicians` with `ä`. And in reality, we may use information. E.g., the german words `fallen` (to fall) and `fällen` (to fell) look the same with this processing step. Why is such a preprocessing useful nonetheless?

## Natural Language Understanding

Now let's dive into some of the _spaCy_ features for NLU.
Given that we have a parse of a document, from a purely grammatical standpoint we can pull the [noun chunks](https://spacy.io/usage/linguistic-features#noun-chunks), i.e., each of the noun phrases:

In [None]:
text = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."
text2 = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit."
doc = nlp(text2)

for chunk in doc.noun_chunks:
    print(chunk.text)

Not bad. The noun phrases in a sentence generally provide more information content – as a simple filter used to reduce a long document into a more "distilled" representation.

We can take this approach further and identify [named entities](https://spacy.io/usage/linguistic-features#named-entities) within the text, i.e., the proper nouns:

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
displacy.render(doc, style="ent", jupyter=True)

In [None]:
displacy.render(nlp("Bill Gates founded Microsoft"), style="ent", jupyter=True)

## Using the Parse Tree for Relation Classification
The shortest dependency path (SDP) can be used for relation classification, the task of determining the relation expressed between to entities in a text. [[Xu et al. 2015](https://aclanthology.org/D15-1206/)]

In the following text, notice how `causes` is the verb on the shortest (undirected) path between A and B in the dependence tree.

In [None]:
displacy.render(nlp("A certainly causes B"), style="dep", jupyter=True)
displacy.render(nlp("He was born before 2000 in the capital of Germany, Berlin."), style="dep", jupyter=True)
displacy.render(nlp("He was born before 2000 in the capital of Germany, Berlin."), style="ent", jupyter=True)