# Keyword Extraction

The simple count based method to extract sublanguage specific vocabulary only allows explorative approaches. It gives no objective measurement of how specific a word is to a sublanguage corpus.
To alleviate this problem we can either use Log-Likelihood or tf-idf to extract sublanguage specific vocabulary.

### TF-IDF

To determine the difference between 2 or more sources, we have to formulate a weight for the
each word with regards to each text source. One possible measure is the tf-idf measure which is a weighting based on the unique usage of a term in single documents. The more often a term is used in different
documents the less importance it gets w.r.t. the tf-idf weight. In detail, this follows the intuition
that a term which appears very often can’t be unique to a certain class or domain.
Following Wikipedia, the tf-idf value increases proportionally to the number of times a word
appears in the document, but is often offset by the frequency of the word in the corpus, which
helps to adjust for the fact that some words appear more frequently in general.[1]

For normalized term frequency $tf(t,D)$ there are various options (see lecture, videos in moodle or research).

### Log Likelihood 
Another possibility to measure relative importance of words is Log-Likelihood.
When using a reference corpus for comparison we use the word-counts in the different domains and
a reference corpus in order to determine significant differences. 
The used significance test is called the “Log-Likelihood”-Ratio Test (LL). The LL-value gives the expectation of a term to be appearing in the target w.r.t. the reference
corpus. 


### Corpora

We provide text for the three domains `Automobil`, `Wirtschaft` and `Sport`.
When in need of a reference corpus, visit the [wortschatz-portal](https://wortschatz.uni-leipzig.de/de/download/German#deu_news_2021) and download a large enough sample of references, around 4 million sentences should suffice.

### Text Preprocessing

Be aware that the prepocessing of text has considerable influence on the outcome. Part of this exercise is to
to deploy a reasonable preprocessing pipeline. Make use of the knowledge about the Zipf distribution and other text preprocessing techniques.

To analyze differences we need to build a single "document" for each domain. This
means, if there is more than one document per domain, we’ll concat all texts belonging to one domain to a single text source.

### Also

It makes sense to first introduce a function that transforms a collection of documents into an Document-Term-Matrix (DTM). For that, the numpy library's array class is worth a look. (In practice data sizes may quickly exceed memory. It is then necessary to consider data structures to accomodate for that, e.g. sparse arrays. In this exercise standard numpy arrays should suffice.)

**Hint 1** If you use numpy, be aware that numpy contains a lot of useful functions like logarithms or sorting.

**Hint 2** Beware of numerical traps like the undefined logarithm of 0.

### Task

Apply the two measures tf-idf and Log-Likelihood to extract the top keywords for the 3 corpora `Automobil`, `Wirtschaft` and `Sport`.

In [1]:
import nltk
import numpy as np
import pandas as pd # Imported for prettier output of keyword lists.

In [2]:
import zipfile

content = {}
with zipfile.ZipFile("keyword.zip") as zfile:
    for f in zfile.namelist():
        if f != "keyword/":
            content[f] = zfile.read(f).decode("utf8")

topics = content.keys()
topics = {t: t.split("/")[-1].split("_")[0] for t in topics}
print(topics)

{'keyword/automobil_50k.txt': 'automobil', 'keyword/wirtschaft_50k.txt': 'wirtschaft', 'keyword/sport_50k.txt': 'sport'}


### Single File Reference

In [3]:
path_single_file = "deu_news_2021_100K-sentences.txt"
with open(path_single_file) as fs:
    reference = fs.readlines()
    reference = " ".join([line.split("\t")[1].lower() for line in reference])

In [4]:
print(reference[:200])

…(15) aus roitham, nachdem er mit seinem moped in einen traktor gekracht war.
 "15 richter haben ja gesagt, aus unterschiedlichen politischen richtungen.
 "1959 war es noch eine gesamtdeutsche mannsch


### Text Cleaning

In [5]:
def preprocess(txt):
    #txt = txt if len(txt[0]) > 1 else nltk.word_tokenize(txt)
    txt = txt if len(txt[0]) > 1 else txt.split(" ")
    txt = [
        x.replace("ß","ss").lower()
        for x in txt
        if x.isalpha() and len(x) > 1
    ]
    return txt

#### Create DTM in Steps

In [6]:
# def create_dtm()

# pruning of vocabulary
cut_first=200 # prune top 200 (stopwords etc.)
min_freq=3 # at least 3 occurrences required else probably not important enough for all texts

# Clean texts
texts = {k: preprocess(v) for k,v in content.items()}
        
# Count texts' global vocabulary
vocab = nltk.FreqDist(sum(texts.values(),[]))
print("vocab size:", len(vocab))

# Prune overall vocabulary
vocab = sorted(list(vocab.items()), key=lambda x: -x[1]) 
vocab = vocab[cut_first:]
vocab = [x[0] for x in vocab if x[1] >= min_freq]
print("vocab size:", len(vocab))

vocab size: 111920
vocab size: 34185


In [7]:
# count and restrict domain level text to the vocabulary
# 1. all text specific keywords
term_frequencies = {genre: nltk.FreqDist(text) for genre, text in texts.items()}
# 2. Create the global DTM by iterating over the global (!) voabulary.
dtm = np.array([[v.get(w, 0) for w in vocab] for k,v in term_frequencies.items()])
print("DTM   shape", dtm.shape)
print(dtm)

# columns
words = np.array([w for w in vocab])
print("\nwords shape", words.shape)
print(words)

# rows
topics = content.keys()
topics = np.array([t.split("/")[-1].split("_")[0] for t in topics])[:,np.newaxis]
print("\ntopics shape", topics.shape)
print(topics)


# 3. compute term frequencies / inverse document frequencies
tf = dtm / dtm.max(-1, keepdims=True)
idf = np.log(dtm.shape[0] / ((dtm>0).sum(0) + 1e-25))
print("\ntf    shape", tf.shape)
print(tf)
print("\nidf   shape", idf.shape)
print(idf)
# 4. TF-IDF
tfidf = tf * idf
print("\ntfidf shape", tfidf.shape)
print(tfidf)

DTM   shape (3, 34185)
[[440 817 333 ...   0   0   0]
 [214  32 222 ...   0   0   0]
 [206  11 301 ...   3   3   3]]

words shape (34185,)
['kaum' 'wagen' 'lange' ... 'nielsen' 'tamara' 'huh']

topics shape (3, 1)
[['automobil']
 ['wirtschaft']
 ['sport']]

tf    shape (3, 34185)
[[0.53855569 1.         0.40758874 ... 0.         0.         0.        ]
 [0.38282648 0.05724508 0.39713775 ... 0.         0.         0.        ]
 [0.27357238 0.01460823 0.3997344  ... 0.00398406 0.00398406 0.00398406]]

idf   shape (34185,)
[0.         0.         0.         ... 1.09861229 1.09861229 1.09861229]

tfidf shape (3, 34185)
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.00437694 0.00437694 0.00437694]]


In [8]:
# extract keywords (sort by decreasing tf-idf, take top-n words)
docnr = 0
n = 10
print("doc:", list(content.keys())[docnr])
words = [vocab[k] for k in tfidf[docnr].argsort(0)[::-1][:n]]
print(words)

doc: keyword/automobil_50k.txt
['hubraum', 'litern', 'diesel', 'autofahrer', 'verbrauch', 'roadster', 'coupé', 'fahrzeugs', 'durchschnittsverbrauch', 'design']


### Keyword Extractor

In [9]:
class TFIDF:
    def create_dtm(self, texts, cut_first=200, min_freq=3):
        
        # Clean texts
        self.texts = {k: preprocess(v) for k,v in texts.items()}
        
        # Count texts' global vocabulary
        self.vocab = nltk.FreqDist(sum(self.texts.values(),[]))
        
        # Prune overall vocabulary
        self.vocab = sorted(list(self.vocab.items()), key=lambda x: -x[1]) 
        self.vocab = [x[0] for x in self.vocab[cut_first:] if x[1] >= min_freq]
        
        # count and restrict domain level text to the vocabulary
        # 1. all text specific keywords
        self.term_frequencies = {genre: nltk.FreqDist(text) for genre, text in self.texts.items()}
        # 2. Create the global DTM by iterating over the global (!) voabulary.
        self.dtm = np.array([[v.get(w,0) for w in self.vocab] for k,v in self.term_frequencies.items()])
    
    def tfidf(self):
        #tf = np.log(self.dtm + 1e-25)# alternative normalization: Logarithmic
        tf = self.dtm / self.dtm.max(-1, keepdims=True)
        idf = np.log(self.dtm.shape[0] / ((dtm>0).sum(0) + 1e-25))
        return tf * idf

    def tfidf_keywords(self, n=10):
        """Iterate all copora for printing"""
        tfidf = self.tfidf()
        return {
            k: [
                self.vocab[k] for k in tfidf[i].argsort(0)[::-1][:n]
            ]
            for i,k in enumerate(self.texts.keys())
        }

In [10]:
da = TFIDF()
da.create_dtm(content)

In [11]:
pd.DataFrame(da.tfidf_keywords(n=25))

Unnamed: 0,keyword/automobil_50k.txt,keyword/wirtschaft_50k.txt,keyword/sport_50k.txt
0,hubraum,dax,tsv
1,litern,gdl,vfb
2,diesel,verlinken,sg
3,autofahrer,zentralbank,hertha
4,verbrauch,aktien,borussia
5,roadster,commerzbank,kader
6,coupé,index,torhüter
7,fahrzeugs,mehdorn,verlinken
8,durchschnittsverbrauch,ubs,tabellenführer
9,design,arbeitsmarkt,fsv


--> https://temir.org/teaching/natural-language-processing-ss24/materials/part-en-nlp-keywords.pdf  
--> `NLP:IX-20 NLP Applications`  

In [12]:
class LogLike:
    def create_dtm(self, texts, cut_first=200, min_freq=3):
        
        # Clean texts
        self.texts = {k:preprocess(v) for k,v in texts.items()}
        
        # Count texts' global vocabulary
        self.vocab = nltk.FreqDist(sum(self.texts.values(),[]))
        
        # Prune overall vocabulary
        self.vocab = sorted(list(self.vocab.items()),key= lambda x: -x[1]) 
        self.vocab = [x[0] for x in self.vocab[cut_first:] if x[1] >=min_freq]
        
        # count and restrict domain level text to the vocabulary
        # 1. all text specific keywords
        self.term_frequencies = {genre: nltk.FreqDist(text) for genre, text in self.texts.items()}
        # 2. Create the global DTM by iterating over the global (!) voabulary.
        self.dtm = np.array([[v.get(w,0) for w in self.vocab] for k,v in self.term_frequencies.items()])

    def log_likelihood(self, corpus, n=None, threshold=None):
        i = list(da.texts.keys()).index(corpus)
       
        # document for corpus (a) and reference (b)
        a = self.dtm[i]
        b = self.dtm[list(da.texts.keys()).index("reference")]

        # total frequency (count tokens)
        c = a.sum()
        d = b.sum()

        # log-likelihood test
        e1 = c * (a + b) / (c + d) + 1e-25
        e2 = d * (a + b) / (c + d) + 1e-25

        ll = 2 * (a * np.log(a / e1 + 1e-25) + b * np.log(b / e2 + 1e-25))
        # likelihood estimate for each word in vocabulary

        if threshold is not None:
            return [da.vocab[k.item()] for k in np.where(ll > threshold)[0]]

        if n is not None:
            return [self.vocab[k] for k in ll.argsort(0)[::-1][:n]]


    def ll_keywords(self, n=None, threshold=None):
        return {k: self.log_likelihood(k, n=n, threshold=threshold) for k in self.texts.keys() if k != "reference"}

In [13]:
content["reference"] = reference # Add the reference corpus
da = LogLike()
da.create_dtm(content, cut_first=200)

In [14]:
llk = da.ll_keywords(n=25)
pd.DataFrame(llk)

Unnamed: 0,keyword/automobil_50k.txt,keyword/wirtschaft_50k.txt,keyword/sport_50k.txt
0,ps,bank,trainer
1,liter,dollar,mannschaft
2,bmw,banken,saison
3,audi,polizei,sv
4,wagen,opel,verlinken
5,vw,verlinken,sieg
6,mercedes,konzern,kostenfrei
7,motor,kostenfrei,link
8,opel,link,spieler
9,litern,krise,stehenden
