BLEU Score
- BLEU = BiLingual Evaluation Understudy
- BLEU tells us how good the generated sentence is compared to the reference ground truths
- As BLEU tells us how good the prediction is, it is often compared with Precision
- BLEU score is computer with the multiplication of Brevity Penalty and Geometric Mean of Precision
- Heavily used in Machine Translation
[!def] BLEU Score
$$
\begin{align*}
\text{BLEU-N} &= \text{Brevity-Penalty} * \exp (\text{Geometric Mean of Precision}{1...N})
\end{align*}
$$
$$
\text{Brevity-Penalty} = min(1, \exp (1 - \frac{\text{reference-length}}{\text{generation-length}}))
$$
$$
\text{Geometric Mean of Precision}{1...N} = \sum_{n=1}^N w_n log ; pre_n
$$
$$
pre_n = \frac{min(\text{# of n-gram word matched}, \text{# of n-gram word in taget})}{\text{total # of n-grams in the generation}}
$$
Problems with BLEU Score
- Doesn't consider semantic meaning
- Struggles with non-english language
- Hard to compare with different tokenizers
- Doesn't consider synonyms