{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Information Retrieval Lab: Advanced Retrieval"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise, we will implement a more advanced retrieval model. In contrast to the last exercise, where we explored vector space retrieval using TFIDF-vectors, in this notebook, we will take a look at a generative (language) model for retrieval."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lecture Recall\n",
"\n",
"First, recall some important concepts from the lecture:\n",
"\n",
"\n",
"###### Document representations\n",
"> The representation $\\pmb{d}$ of a document $d$ is a probability distribution over the index terms $T$, where the probability of the $i$-th term in $T$ indicates how likely it occurs in documents generated by the topicsâ€™ language model that also generated $\\pmb{d}$.\n",
"\n",
"###### Query representations\n",
"> A query $q$ is represented as a sequence $\\pmb{q}$ of index terms.\n",
"\n",
"###### Relevance Function $\\rho$\n",
"> The relevance function formulated as a query likelihood model: $\\rho(d,q) =P(\\pmb{d}|\\pmb{q})$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"To implement a scoring model, we first import existing code from the previous exercises. We will make use of the Shakespeare document collection, the preprocessing component and the index. We supply a reference solution you can import, but feel free to continue using your own code!\n",
"\n",
"**Exercise**: using the code from previous exercises, construct an index on top of the Shakespeare document collection."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from modules.shakespeare import corpus\n",
"from modules.preprocessing import PreprocessorSpacy as Preprocessor\n",
"from modules.indexing import Index\n",
"\n",
"preprocessor = Preprocessor()\n",
"corpus_preprocessed = list(zip(\n",
" [doc[0] for doc in corpus],\n",
" map(preprocessor.preprocess, [doc[1] for doc in corpus])\n",
"))\n",
"\n",
"index = Index(corpus_preprocessed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deriving the relevance function $\\rho$\n",
"\n",
"Let $d$ denote a language model for document, and $q$ the sequence of query terms from query. Then the query likelihood model is derived as follows, utilizing Bayes' rule:\n",
"\n",
"$$ \\rho = P(\\pmb{d} | \\pmb{q}) = \\frac{P(\\pmb{q}|\\pmb{d})\\cdot P(\\pmb{d})}{P(\\pmb{q})} $$\n",
"\n",
"Here, $P(\\pmb{q})$ can be omitted without affecting the relative ranking of all documents (i.e. omitting it influences all scores equally), and $P(\\pmb{d})$ is assumed uniform, cancelling its influence.\n",
"\n",
"The query representation $\\pmb{q}$ can be written as sequence of $t_1, ..., t_{|q|}$ of index terms, which (under the assumption that terms are independent of each other) allows us to rewrite the equation as follows:\n",
"\n",
"$$ \\rho = P(q | \\pmb{d}) = P(t_1, ..., t_{|q|} | \\pmb{d}) = \\prod_{i=1}^{|q|} P(t_i | \\pmb{d})$$\n",
"\n",
"This means that we can divide our implementation into four steps:\n",
"1. Write a function to calculate the probability $P(t | d)$ for a single term\n",
"2. Apply this function to all query terms\n",
"3. Multiply the results to get the relevance score of a document\n",
"4. Repeate 1-3 for all documents to rank them w.r.t the query\n",
"\n",
"We can use maximum likelihood estimation to calculate the term probability:\n",
"\n",
"$$ P(t | \\pmb{d}) = \\frac{\\text{tf}(t,d)}{|d|}$$\n",
"\n",
"However, this creates a problem: if a term $t$ is contained in the query, but does not occur in a document (i.e. $\\text{tf}(t,d) = 0$), $P(t|\\pmb{d}) = 0$. Since the individual term probabilities are multiplied, this results in zero relevance for all documents that only partially contain the query!\n",
"\n",
"**Solution**: we also calculate the probability of the term based on the language model of the corpus, $D$, and combine both probabilities."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Document Probability\n",
"\n",
"As first part to our relevance function, we implement the probability of a term based on a document language model $P(t|\\pmb{d})$.\n",
"\n",
"$$ P(t | \\pmb{d}) = \\frac{\\text{tf}(t,d)}{|d|}$$\n",
"\n",
"**Exercise**: implement a function to calculate the term probability give a document."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def document_probability(index, term, doc_id):\n",
" \"\"\"\n",
" Calculates the conditional probability of a term give a document.\n",
" :param index: index to get frequency data from\n",
" :param term: term to calculate the probability for\n",
" :param doc_id: document to calculate the probability for\n",
" \"\"\"\n",
" frequency = index.get_term_frequency(term, doc_id)\n",
" doc_sum = 0\n",
" for t in index.get_index_terms():\n",
" doc_sum += index.get_term_frequency(t, doc_id)\n",
" if doc_sum == 0:\n",
" return 0\n",
" else: \n",
" return frequency/doc_sum\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Collection Probability\n",
"\n",
"As first part to our relevance function, we implement the probability of a term based on a document language model $P(t|\\pmb{D})$.\n",
"\n",
"$$ P(t | \\pmb{D}) = \\frac{\\sum_{d \\in D}\\text{tf}(t,d)}{\\sum_{d \\in D}|d|}$$\n",
"\n",
"**Exercise**: implement a function to calculate the term probability give a document collection."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def collection_probability(index, term):\n",
" \"\"\"\n",
" Calculates the conditional probability of a term give a document collection.\n",
" :param index: index to get frequency data from\n",
" :param term: term to calculate the probability for\n",
" \"\"\"\n",
" frequencies = []\n",
" doc_sums = []\n",
" \n",
" for doc_id in index.get_document_ids():\n",
" frequencies.append(index.get_term_frequency(term, doc_id))\n",
" doc_sum = 0\n",
" for t in index.get_index_terms():\n",
" doc_sum += index.get_term_frequency(t, doc_id)\n",
" doc_sums.append(doc_sum)\n",
" \n",
" return sum(frequencies)/sum(doc_sums)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combining Probabilities\n",
"\n",
"#### Jelinek-Mercer-Smoothing\n",
"\n",
"Now that we can calculate both the document probability and the collection probability of a term $t$, we need a way of combining them into a single value, indicating the overall probability of the term. To start out, we implement *Jelinek-Mercer* smoothing, which is a linear interpolation between both probabilities:\n",
"\n",
"$$P(t|\\pmb{d}) = (1- \\omega) \\cdot P(t | \\pmb{d}) + \\omega\\cdot P(t|\\pmb{D})$$\n",
"\n",
"where $\\omega$ is a paramter that controls the relative influence of both terms on the result.\n",
"\n",
"*Note:* we call the weigthing term $\\omega$ in this exercise. In the lecture slides, this is denoted as $\\lambda$, but since `lambda` is a reserved keyword in the Python language, we opt to rename it to avoid confusion.\n",
"\n",
"**Exercise**: implement the overall term probability with Jelinek-Mercer smoothing."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"def jm_term_probability(index, term, doc_id, omega):\n",
" \"\"\"\n",
" Calculates the conditional probability of a term give a document using Jelinek-Mercer smoothing.\n",
" :param index: index to get frequency data from\n",
" :param term: term to calculate the probability for\n",
" :param doc_id: document to calculate the probability for\n",
" :param omega: weigthing factor for the linear interpolation\n",
" \"\"\"\n",
" p1 = document_probability(index, term, doc_id)\n",
" p2 = collection_probability(index, term)\n",
" return (1-omega) * p1 + omega * p2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dirichlet-Smoothing\n",
"\n",
"Instead of interpolating linearly between both probabilities, we can take a more fine-grained approach and base the interpolation factor on the length of the document: the longer it is, the more confidence is placed on the document probability, assuming *more data = better prediction*.\n",
"\n",
"This substitutes the term $\\omega$ with the following term:\n",
"\n",
"$$\\omega = \\frac{\\alpha}{|d| + \\alpha}$$\n",
"\n",
"where $\\alpha$ is a parameter that indicates the boundary at which more weight is put towards the document probability or the collection probability.\n",
"\n",
"*Note*: this method is called Dirichlet smoothing, since from a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter $\\alpha$ as a prior.\n",
"\n",
"**Exercise**: implement a weight function to calculate $\\omega$ for a given document and $\\alpha$ value."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def weight(index, doc_id, alpha):\n",
" \"\"\"\n",
" Calculates the dirichlet smoothing weighting factor for a given document and alpha value\n",
" :param index: index to get frequency data from\n",
" :param doc_id: document to calculate the weight factor for\n",
" :param alpha: alpha-prior for the dirichlet smoothing\n",
" \"\"\"\n",
" doc_len = 0\n",
" for term in index.get_index_terms():\n",
" doc_len += index.get_term_frequency(term, doc_id)\n",
" return alpha / (doc_len + alpha)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise**: implement the overall term probability with Dirichlet smoothing."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def dirichlet_term_probability(index, term, doc_id, alpha):\n",
" \"\"\"\n",
" Calculates the conditional probability of a term give a document using Dirichlet smoothing.\n",
" :param index: index to get frequency data from\n",
" :param term: term to calculate the probability for\n",
" :param doc_id: document to calculate the probability for\n",
" :param alpha: alpha-prior for the dirichlet interpolation\n",
" \"\"\"\n",
" omega = weight(index, doc_id, alpha)\n",
" p1 = document_probability(index, term, doc_id)\n",
" p2 = collection_probability(index, term)\n",
" return (1-omega) * p1 + omega * p2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementing Relevance Functions \n",
"\n",
"#### Jelinek-Mercer Smoothing\n",
"\n",
"Given the probability function for a single term $t$ using Jelinek-Mercer-Smoothing, we can now implement the full relevance scoring function. We will however utilize a slightly different formulation of the total probability by taking its logarithm:\n",
"\n",
"$$ \\rho = \\log\\prod_{i=1}^{|q|} P(t_i | \\pmb{d}) = \\sum_{i=1}^{|q|} \\log P(t_i | \\pmb{d})$$\n",
"\n",
"The reason behind this is that the individual term probabilities can get extremely small. If we multiply them, we might run into an *underflow*: the number being too small to be represented by a 64-bit float. Therefore, utilizing the logarithm product rule, we can instead calculate the sum of the logarithm of the individual probabilities. Note that logarithmization yields negative relevance scores; recall that only the relative ranking among documents is important, not the scores themselves.\n",
"\n",
"**Exercise**: implement the relevance function $\\rho$ with Jelinek-Mercer smoothing. Use the sum of log probabilities as defined above."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from math import log\n",
"\n",
"def jm_score(index, query, doc_id, omega):\n",
" \"\"\"\n",
" Calculates the relevance of a document given a query using Jelinek-Mercer smoothing.\n",
" :param index: index to get relevance data from\n",
" :param query: query to calculate the relevance for\n",
" :param doc_id: document to calculate the relevance for\n",
" :param omega: omega paramter for Jelinek-Mercer smoothing\n",
" \"\"\"\n",
" rho = 1\n",
" for term in query:\n",
" rho += log(jm_term_probability(index, term, doc_id, omega))\n",
" return rho"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dirichlet Smoothing\n",
"\n",
"Similarly, we can implement $\\rho$ with Dirichlet smoothing, once again applying the logarithm.\n",
"\n",
"**Exercise**: implement the relevance function $\\rho$ with Dirichlet smoothing. Use the sum of log probabilities as defined above."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from math import log\n",
"\n",
"def dirichlet_score(index, query, doc_id, alpha):\n",
" \"\"\"\n",
" Calculates the relevance of a document given a query using Dirichlet smoothing.\n",
" :param index: index to get relevance data from\n",
" :param query: query to calculate the relevance for\n",
" :param doc_id: document to calculate the relevance for\n",
" :param alpha: alpha paramter for Dirichlet smoothing\n",
" \"\"\"\n",
" rho = 1\n",
" for term in query:\n",
" rho += log(dirichlet_term_probability(index, term, doc_id, alpha))\n",
" return rho"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying the index\n",
"\n",
"We now have all the components needed to answer queries. Write a wrapper function that takes a text and returns the (top $k$) results for the query. It construct a query representation and calculate the pairwise probability between the query and all documents, returning an ordered list of `(doc_id, score)` tuples, descending by score.\n",
"\n",
"#### Jelinek-Mercer Smoothing\n",
"\n",
"**Exercise**: implement a query function for Jelinek-Mercer smoothing. Successful parameters are $\\omega= 0.1$ and $\\omega$= 0.7 for short and long queries, respectively."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def jm_query(index, preprocessor, text, omega=0.1, topK=-1):\n",
" \"\"\"\n",
" Queries a given text against the given index using a Jelinek-Mercer smoothed language model\n",
" :param preprocessor: preprocessor instance to process the query with\n",
" :param index: the index data to query against\n",
" :param text: query text\n",
" :param omega: weight parameter for Jelinek-Mercer smoothing\n",
" :param topK: number of top results to return\n",
" :return: list of (doc_id, score) tuples descending by score for all documents in the vector space\n",
" \"\"\"\n",
" query = preprocessor.preprocess(text)\n",
" scores = {}\n",
" for doc_id in index.get_document_ids():\n",
" scores[doc_id] = jm_score(index, query, doc_id, omega=omega)\n",
" \n",
" return sorted(scores.items(), key=lambda item: item[1], reverse=True)[:topK]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Coriolanus', -0.203513772874222),\n",
" ('Comedy of Errors', -1.1835784414003485),\n",
" ('Troilus and Cressida', -7.890135107818841),\n",
" ('Henry IV, Part II', -7.890135107818841),\n",
" ('Much Ado About Nothing', -7.890135107818841)]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jm_query(index, preprocessor, \"proceed\", topK=5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Dirichlet Smoothing\n",
"\n",
"**Exercise**: implement a query function for Dirichlet smoothing. Typical parameters are $1000 < \\alpha < 2000$."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def dirichlet_query(index, preprocessor, text, alpha=1000, topK=-1):\n",
" \"\"\"\n",
" Queries a given text against the given index using a Dirichlet smoothed language model\n",
" :param preprocessor: preprocessor instance to process the query with\n",
" :param index: the index data to query against\n",
" :param text: query text\n",
" :param alpha: alpha-parameter for Dirichlet smoothing\n",
" :param topK: number of top results to return\n",
" :return: list of (doc_id, score) tuples descending by score for all documents in the vector space\n",
" \"\"\"\n",
" query = preprocessor.preprocess(text)\n",
" scores = {}\n",
" for doc_id in index.get_document_ids():\n",
" scores[doc_id] = dirichlet_score(index, query, doc_id, alpha=alpha)\n",
" \n",
" return sorted(scores.items(), key=lambda item: item[1], reverse=True)[:topK]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('Coriolanus', -5.044738931143362),\n",
" ('Comedy of Errors', -5.049711591812738),\n",
" ('Hamlet', -5.587550014824796),\n",
" ('Measure for Measure', -5.5885495151578795),\n",
" ('Tempest', -5.5885495151578795)]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dirichlet_query(index, preprocessor, \"proceed\", topK=5)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}