Tools for building corpora with dictionaries
Text corpora usually reside on disk, as text files in one format or another. In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= iterable of (word_id, word_weight)).
>>> from gensim.corpora import Dictionary
>>> texts = [['human', 'interface', 'computer']]
>>> dct = Dictionary(texts) # initialize a Dictionary
>>> dct.add_documents([["cat", "say", "meow"], ["dog"]]) # add more document (extend the vocabulary)
>>> dct.doc2bow(["dog", "computer", "non_existent_word"])
[(0, 1), (6, 1)]
TF-IDF (term frequency-inverse document frequency) was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.Term Frequency
In vector space models, such as TF-IDF, the term weight is a function of a term’s frequency within a document. Terms that occur more frequently better reflect the document’s meaning. However, terms that appear less frequently may be more discriminating when comparing documents.
The term frequency is the number of occurrences of a term in document divided by the total number of terms in the document.
However, if the word Bug appears many times in a document, while not appearing many times in others, it probably means that it’s very relevant. For example, if what we’re doing is trying to find out which topics some NPS responses belong to, the word Bug would probably end up being tied to the topic Reliability, since most responses containing that word would be about that topic.Inverse Document Frequency
The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm. The inverse document frequency assigns higher weights to more discriminative terms.
So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.TF-IDF
A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0.