TF-IDF

  • TF-IDF is used to find the importance of a word in multiple documents
  • TF = Term Frequency
    • number of times the word is in a document
  • IDF = Inverse Document Frequency
    • how relevant that term is across all documents
  • TF-IDF is the product of TF and IDF

[!def] TF-IDF
$$
TF-IDF(w) = \frac{count_{word ; w}\text{ in a doc }}{\text{total # of words in a doc}} log \frac{\text{# of documents with word w}}{\text{Total # of docs}}
$$

  • TF-IDF can be used as Word Embedding also, by replacing $1$ in one-hot vector by the TF-IDF score.