BLEU Score

  • BLEU = BiLingual Evaluation Understudy
  • BLEU tells us how good the generated sentence is compared to the reference ground truths
  • As BLEU tells us how good the prediction is, it is often compared with Precision
  • BLEU score is computer with the multiplication of Brevity Penalty and Geometric Mean of Precision
  • Heavily used in Machine Translation

[!def] BLEU Score
$$
\begin{align*}
\text{BLEU-N} &= \text{Brevity-Penalty} * \exp (\text{Geometric Mean of Precision}{1...N})
\end{align*}
$$
$$
\text{Brevity-Penalty} = min(1, \exp (1 - \frac{\text{reference-length}}{\text{generation-length}}))
$$
$$
\text{Geometric Mean of Precision}
{1...N} = \sum_{n=1}^N w_n log ; pre_n
$$
$$
pre_n = \frac{min(\text{# of n-gram word matched}, \text{# of n-gram word in taget})}{\text{total # of n-grams in the generation}}
$$

Pasted image 20231022131416.png

Pasted image 20231022131355.png

Problems with BLEU Score

  1. Doesn't consider semantic meaning
  2. Struggles with non-english language
  3. Hard to compare with different tokenizers
  4. Doesn't consider synonyms