# IR Lab WiSe 2023: Topics, Documents, and Relevance Judgments

Information retrieval experiments follow the [Cranfield Paradigm](https://en.wikipedia.org/wiki/Cranfield_experiments) that states that retrieval systems are evaluated using a set of information needs (topics), documents, and relevance judgments.

We will use the [ir_datasets](https://ir-datasets.com/) and [tira](https://www.tira.io/task-overview/ir-lab-jena-leipzig-sose-2023) libraries to look at some examples using the retrieval scenario of the [IR Anthology](https://ir.webis.de/).

### Preparation: Install dependencies

First, we install the libraries `tira` and `ir_datasets`.

In [None]:
# This is only needed in Google Colab, in a dev container, everything should be installed already
!pip3 install tira ir_datasets

### Step 1: Load the dataset and imports



In [1]:
from tira.third_party_integrations import ir_datasets

dataset = ir_datasets.load('ir-lab-jena-leipzig-sose-2023/iranthology-20230618-training')

Load ir_dataset "ir-lab-jena-leipzig-sose-2023/iranthology-20230618-training" from tira.


### Step 2: View first Five Topics

The `dataset.queries_iter()` method creates an iterable over all topics in the dataset.
Each topic has an `query_id`, the string that is submitted to the search engine as query (can be accessed via `default_text`), a `description` that specifies what searchers with this information need are looking for, and a `narrative` that specifies which documents are relevant and which documents are non-relevant.

E.g., Topic `3` tries to identify papers that `help to recognize signs of self-harm in people's social media posts`.

In [2]:
for query in list(dataset.queries_iter())[:3]:
    print('\nQuery: ', query.query_id)
    print('\tText:\t\t' + query.default_text())
    print('\tDescrition:\t' + query.description)
    print('\tNarrative:\t' + query.narrative)

No settings given in /Users/gienapp/.tira/.tira-settings.json. I will use defaults.

Query:  1
	Text:		retrieval system improving effectiveness
	Descrition:	What papers focus on improving the effectiveness of a retrieval system?
	Narrative:	Relevant papers include research on what makes a retrieval system effective and what improves the effectiveness of a retrieval system. Papers that focus on improving something else or improving the effectiveness of a system that is not a retrieval system are not relevant.

Query:  2
	Text:		machine learning language identification
	Descrition:	What papers are about machine learning for language identification?
	Narrative:	Relevant papers include research on methods of machine learning for language identification or how to improve those methods. Papers that focus on other methods for language identification or the usaged of machine learning not for language identification are not relevant.

Query:  3
	Text:		social media detect self-harm
	Descrition:

### Step 3: View First Five Relevance Judgments

The `dataset.qrels_iter()` method creates an iterable over all relevance judgments (qrels for query relevance) of the dataset.
Each qrel entry consists of an `query_id` pointing to an topic, an `doc_id` pointing to a document, and a relevance label indicating if a document is relevant (relevance > 0) or not relevant (relevance is 0) to a query. The iteration field is "historically unused".

E.g., the first line below indicates that document `2005.ipm_journal-ir0anthology0volumeA41A1.7` is relevant to the query `retrieval system improving effectiveness`.


In [3]:
for qrel in list(dataset.qrels_iter())[:5]:
    # Access via: qrel.query_id, qrel.doc_id, qrel.relevance
    print(qrel)

No settings given in /Users/gienapp/.tira/.tira-settings.json. I will use defaults.
TrecQrel(query_id='1', doc_id='2005.ipm_journal-ir0anthology0volumeA41A1.7', relevance=1, iteration='0')
TrecQrel(query_id='1', doc_id='2019.tois_journal-ir0anthology0volumeA37A1.2', relevance=1, iteration='0')
TrecQrel(query_id='1', doc_id='2008.sigirconf_conference-2008.127', relevance=1, iteration='0')
TrecQrel(query_id='1', doc_id='2015.ipm_journal-ir0anthology0volumeA51A5.7', relevance=0, iteration='0')
TrecQrel(query_id='1', doc_id='2008.tois_journal-ir0anthology0volumeA27A1.1', relevance=0, iteration='0')


### Step 4: Access to documents

The `dataset.docs_store()` method provides random access via the document ID to all documents of a corpus.

For instance, `docs_store.get('2005.ipm_journal-ir0anthology0volumeA41A1.7')` returns the document with id `2005.ipm_journal-ir0anthology0volumeA41A1.7` that has the text `"A probabilistic model for ... linguistic knowledge."`.

The `dataset.docs_iter()` method creates an iterable over all documents in a corpus (can be suitable to build an index).

In [4]:
print('The dataset has', len(list(dataset.docs_iter())), 'documents.')

No settings given in /Users/gienapp/.tira/.tira-settings.json. I will use defaults.
The dataset has 53673 documents.


In [5]:
docs_store = dataset.docs_store()

docs_store.get('2005.ipm_journal-ir0anthology0volumeA41A1.7')

GenericDoc(doc_id='2005.ipm_journal-ir0anthology0volumeA41A1.7', text='A probabilistic model for stemmer generation AbstractIn this paper we will present a language-independent probabilistic model which can automatically generate stemmers. Stemmers can improve the retrieval effectiveness of information retrieval systems, however the designing and the implementation of stemmers requires a laborious amount of effort due to the fact that documents and queries are often written or spoken in several different languages. The probabilistic model proposed in this paper aims at the development of stemmers used for several languages. The proposed model describes the mutual reinforcement relationship between stems and derivations and then provides a probabilistic interpretation. A series of experiments shows that the stemmers generated by the probabilistic model are as effective as the ones based on linguistic knowledge.')

### Step 5: Create some Descriptive Statistics for the Relevance Judgments

Next, we want to create a [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) that shows the proportion of relevant documents per topic.
You can imagine a DataFrame as a table.

First, we show how to use the pandas DataFrame API, second, it is on you to create this table using the real data.

In [6]:
import pandas as pd

df = pd.DataFrame([
    {'query_id': 'test-1', 'query': 'some test query 1', 'proportion_relevant': 0.3},
    {'query_id': 'test-2', 'query': 'some test query 2', 'proportion_relevant': 0.4},
    {'query_id': 'test-3', 'query': 'some test query 3', 'proportion_relevant': 0.2},
])

df.sort_values('proportion_relevant', ascending=False)

Unnamed: 0,query_id,query,proportion_relevant
1,test-2,some test query 2,0.4
0,test-1,some test query 1,0.3
2,test-3,some test query 3,0.2


E.g., in the hypothetical example above, the query `test-2` has the highest proportion of relevant documents.

Next, please create a pandas DataFrame `df` containing that reports the proportion of relevant documents per topic on the real data, using `dataset.queries_iter()` and `dataset.qrels_iter()`.

In [7]:
def proportion_relevant(topic_num):
    rel, non_rel = 0, 0
    for qrel in dataset.qrels_iter():
        if qrel.query_id == str(topic_num):
            rel += 1 if qrel.relevance else 0
            non_rel += 0 if qrel.relevance else 1
    return rel / (rel + non_rel)

df = []
for query in dataset.queries_iter():
    df += [{'qid': query.query_id, 'query': query.title, 'Proportion Relevant': proportion_relevant(query.query_id)}]

df = pd.DataFrame(df)
df.sort_values('Proportion Relevant')

Unnamed: 0,qid,query,Proportion Relevant
13,14,German domain,0.000000
7,8,Document scoping formula,0.069767
66,68,filter ad rich documents,0.116279
10,11,Algorithm acceleration with Nvidia CUDA,0.119048
47,49,exhaustivity of index,0.131579
...,...,...,...
15,16,Inclusion of text-mining,0.926829
31,33,fake news detection,0.935484
41,43,Deep Neural Networks,0.971429
39,41,entity recognition,0.975000


### Step 6: Find Difficult Topics

Identify the query with the lowest proportion of relevant documents

In [None]:
# Your solution here

### Step 7: Find Easy Topics

Identify the query with the highest proportion of relevant documents

In [None]:
# Your solution here

### Step 8: Find a Topic that is not Suitable to Distinguish Retrieval Systems

The goal of retrieval experiments is to seperate effective retrieval systems from ineffective retrieval systems.
Have you an idea what topics in the dataset are not well suited to distinguish effective from ineffective systems?

Please select a single topic, and describe why it is not suited to distinguish retrieval systems.

In [None]:
# Your solution here

### Step 9: Find a Topic that useful to Distinguish Retrieval Systems

The goal of retrieval experiments is to seperate effective retrieval systems from ineffective retrieval systems.
In step 8, you identified a topic that barely can distinguish effective from ineffective retrieval systems.

Now, please identify a single topic that is better suited for separating retrieval systems and explain why the topic is better suited. How did you select your topic, and why?

In [None]:
# Your solution here