TF-IDF, BM25F & LDA algorithms to find duplicate
bug reports filed by the users of Eclipse IDE

Preprocessed four years worth of bug reports and feature requests filed for the Eclipse IDE using NLTK and Gensim; created an online version to find duplicates in real-time using FastAPI web-services.


No software is perfect. Eclipse IDE has an open bug report portal for its 4 million users. There is an extremely high chance that the same bug is viewed by most of the users and everyone will try to report the bug on the portal. Bug reports are reviewed personally by the Eclipse Development Team. A lot of time would be wasted if they see the same bug reported by 10s of users. The same problem is faced by stackoverflow.com where duplicate questions are asked over and over.

The main goal of this project was to learn how to:
- use Natural Language Processing (NLTK & Gensim) to normalize text by tokenization, stop words, stemming, and lemmatization
- understand how "k" parameter affects the results and performance
- what part (title, tags, comments, description) of the bug report to consider?
- which algorithm gives a better recall rate?

The challenge

How to convert a bug report or a feature request into an intelligible format understandable by a machine?

Text Normalization

transforming text into a standard form

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
Point 6.7 (01/17/2020): Mr. O'Neill Jr. thinks that the boys' stories about Chile's capital aren't amusing.
The major challenges of tokenization are handling mixed-case tokens (Point, thinks), tokens with punctuations (boys', amusing.), decimal numbers, abbreviations, and dates.

Stop Words
Stop words are words that are so common to languages that removing them doesn’t affect the overall message enough to lose meaning. Removing stop words does not always improve performance or accuracy. Removing stop words can affect meaning: is / was (temporality), not (trueness).
nltk.corpus.stopwords.words(‘english’) # 179
genism.parsing.preprocessing.STOPWORDS # 337

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.
ING → ∅
S → ∅
Stemming improves performance in information retrieval, especially with smaller documents.

Lexical Semantics and Lemmatization
Lexemes pair a word form (phonological or orthographic) with it’s meaning:
run /run/ [rvn] n. cause to be in operation, continue to be operative for a period of time.
A lemma is the grammatical form used to represent a lexeme: run is the lemma for run, runs, running, and ran.
Lemmas are part-of-speech specific: a table (noun) / to table (verb)
found → find (to locate)
found → found (to create an institution)

Bag of Words (BoW)
The Bag of Words approach takes a document as input and breaks it into words. These words are also known as tokens and the process is termed as tokenization. Unique tokens collected from all processed documents then constitute to form an ordered vocabulary. Finally, a vector of length equivalent to the size of the vocabulary is created for each document with values representative of the frequency of the tokens appearing in the respective document.
The challenge

How to use bag-of-words to find similar documents?

Text Corpus and Dictionary

Tools for building corpora with dictionaries
Text corpora usually reside on disk, as text files in one format or another. In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= iterable of (word_id, word_weight)).
>>> from gensim.corpora import Dictionary
>>> texts = [['human', 'interface', 'computer']]
>>> dct = Dictionary(texts) # initialize a Dictionary
>>> dct.add_documents([["cat", "say", "meow"], ["dog"]]) # add more document (extend the vocabulary)
>>> dct.doc2bow(["dog", "computer", "non_existent_word"])
[(0, 1), (6, 1)]

TF-IDF Algorithm

TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

Term Frequency
In vector space models, such as TF-IDF, the term weight is a function of a term’s frequency within a document. Terms that occur more frequently better reflect the document’s meaning. However, terms that appear less frequently may be more discriminating when comparing documents.

The term frequency is the number of occurrences of a term in document divided by the total number of terms in the document.
However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.

Inverse Document Frequency
The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. The inverse document frequency assigns higher weights to more discriminative terms.
So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0.
The challenge

How to use vectorised sentences to rank similar documents?

Sparse Matrix Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.
We create a corpus in the Vector Space Model and how to transform it between different vector spaces because we want to determine similarity between pairs of documents, or the similarity between a specific document and a set of other documents (such as a user query vs. indexed documents).

Information Retrieval Analysis

Precision: the proportion of true positives to total query responses

Recall: the proportion of true positives to all relevant responses
The Future

How can we increase the recall rate? Can we use synonyms?

Context-ful Synonyms Dictionary

In the context of bug reports, I am working on a list of synonyms. Because we are in the context of software and bugs, I do not need to worry about the alternate meanings (e.g. bugs can also mean insects).
Text Normalization is the only space where there is a room for improvement. With synonyms brought down to the same word, I will rethink the need for lemmatization and stemming. I strongly believe this will improve the recall rate.

How can we find duplicates on StackOverflow?

One of the biggest problems of StackOverflow is duplicate questions. Here is a query for the top 10 most duplicated posts. There are about 21 million questions on stackoverflow. What are the chances of one coming up with a new question?
I am also working on finding the correct duplicate based on the tag, context-ful synonyms dictionary, score of the post, and more parameters.

Check back for updates.

The actual recall rates, technical implementations of TF-IDF, BM25F, and LDA; tweaking the algorithms and my approach towards solving the problems cannot be made public because I am bound by academic integrity. Please contact me to know more. I'd be happy to discuss.

I still work on this project, enhancing whenever I get the time. Currently, I am using FastAPI to search for duplicate in real time. My goal is to create a generalised version of this project.