# bleuEvaluationScore

Evaluate translation or summarization with BLEU similarity score

## Syntax

## Description

The BiLingual Evaluation Understudy (BLEU) scoring algorithm evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

returns the BLEU similarity score between the specified candidate document and the reference
documents. The function computes n-gram overlaps between `score`

= bleuEvaluationScore(`candidate`

,`references`

)`candidate`

and
`references`

for n-gram lengths one through four, with equal weighting.
For more information, see BLEU Score.

specifies additional options using one or more name-value arguments.`score`

= bleuEvaluationScore(`candidate`

,`references`

,`Name=Value`

)

## Examples

### Evaluate Summary

Create an array of tokenized documents and extract a summary using the `extractSummary`

function.

str = [ "The fox jumped over the dog." "The fast brown fox jumped over the lazy dog." "The lazy dog saw a fox jumping." "There seem to be animals jumping other animals." "There are quick animals and lazy animals"]; documents = tokenizedDocument(str); summary = extractSummary(documents)

summary = tokenizedDocument: 10 tokens: The fast brown fox jumped over the lazy dog .

Specify the reference documents as a `tokenizedDocument`

array.

str = [ "The quick brown animal jumped over the lazy dog." "The quick brown fox jumped over the lazy dog."]; references = tokenizedDocument(str);

Calculate the BLEU score between the summary and the reference documents using the `bleuEvaluationScore`

function.

score = bleuEvaluationScore(summary,references)

score = 0.7825

This score indicates a fairly good similarity. A BLEU score close to one indicates strong similarity.

### Specify N-Gram Weights

Create an array of tokenized documents and extract a summary using the `extractSummary`

function.

str = [ "The fox jumped over the dog." "The fast brown fox jumped over the lazy dog." "The lazy dog saw a fox jumping." "There seem to be animals jumping other animals." "There are quick animals and lazy animals"]; documents = tokenizedDocument(str); summary = extractSummary(documents)

summary = tokenizedDocument: 10 tokens: The fast brown fox jumped over the lazy dog .

Specify the reference documents as a `tokenizedDocument`

array.

str = [ "The quick brown animal jumped over the lazy dog." "The quick brown fox jumped over the lazy dog."]; references = tokenizedDocument(str);

Calculate the BLEU score between the candidate document and the reference documents using the default options. The `bleuEvaluationScore`

function, by default, uses n-grams of length one through four with equal weights.

score = bleuEvaluationScore(summary,references)

score = 0.7825

Given that the summary document differs only by one word to one of the reference documents, this score might suggest a lower similarity than might be expected. This behavior is due to the function using n-grams which are too large for the short document length.

To address this, use shorter n-grams by setting the `'NgramWeights'`

option to a shorter vector. Calculate the BLEU score again using only unigrams and bigrams by setting the `'NgramWeights'`

option to a two-element vector. Treat unigrams and bigrams equally by specifying equal weights.

`score = bleuEvaluationScore(summary,references,'NgramWeights',[0.5 0.5])`

score = 0.8367

This score suggests a better similarity than before.

## Input Arguments

`candidate`

— Candidate document

`tokenizedDocument`

scalar | string array | cell array of character vectors

Candidate document, specified as a `tokenizedDocument`

scalar, a string array,
or a cell array of character vectors. If
`candidate`

is not a
`tokenizedDocument`

scalar, then it
must be a row vector representing a single document, where each
element is a word.

`references`

— Reference documents

`tokenizedDocument`

array | string array | cell array of character vectors

Reference documents, specified as a `tokenizedDocument`

array, a string array,
or a cell array of character vectors. If `references`

is not a
`tokenizedDocument`

array, then it must be a row vector representing
a single document, where each element is a word. To evaluate against multiple reference
documents, use a `tokenizedDocument`

array.

### Name-Value Arguments

Specify optional pairs of arguments as
`Name1=Value1,...,NameN=ValueN`

, where `Name`

is
the argument name and `Value`

is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.

*
Before R2021a, use commas to separate each name and value, and enclose*
`Name`

*in quotes.*

**Example: **`bleuEvaluationScore(candidate,references,IgnoreCase=true)`

evaluate the BLEU similarity score ignoring case

`NgramWeights`

— N-gram weights

`[0.25 0.25 0.25 0.25]`

(default) | row vector of finite nonnegative values

N-gram weights, specified as a row vector of finite nonnegative values, where
`NgramWeights(i)`

corresponds to the weight for n-grams of length
`i`

. The length of the weight vector determines the range of n-gram
lengths to use for the BLEU score evaluation. The function normalizes the n-gram
weights to sum to one.

**Tip**

If the number of words in `candidate`

is smaller than the number of
elements in `ngramWeights`

, then the resulting BLEU score is zero. To
ensure that `bleuEvaluationScore`

returns nonzero scores for very short
documents, set `ngramWeights`

to a vector with fewer elements than the
number of words in `candidate`

.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

`IgnoreCase`

— Option to ignore case

`0`

(`false`

) (default) | `1`

(`true`

)

Option to ignore case, specified as one of these values:

`0`

(`false`

) – use case-sensitive comparisons between candidates and references.`1`

(`true`

) – compare candidates and references ignoring case.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `logical`

## Output Arguments

`score`

— BLEU score

scalar

BLEU score, returned as a scalar value in the range [0,1] or
`NaN`

.

A BLEU score close to zero indicates poor similarity between
`candidate`

and `references`

. A BLEU score close
to one indicates strong similarity. If `candidate`

is identical to
one of the reference documents, then `score`

is 1. If
`candidate`

and `references`

are both empty
documents, then `score`

is `NaN`

. For more
information, see BLEU Score.

**Tip**

If the number of words in `candidate`

is smaller than the number of
elements in `ngramWeights`

, then the resulting BLEU score is zero. To
ensure that `bleuEvaluationScore`

returns nonzero scores for very short
documents, set `ngramWeights`

to a vector with fewer elements than the
number of words in `candidate`

.

## Algorithms

### BLEU Score

The BiLingual Evaluation Understudy (BLEU) scoring algorithm [1] evaluates the similarity between a candidate document and a collection of reference documents. Use the BLEU score to evaluate the quality of document translation and summarization models.

To compute the BLEU score, the algorithm uses n-gram counts, *clipped n-gram
counts*, *modified n-gram precision scores*, and a
*brevity penalty*.

The clipped n-gram counts function $${\text{Count}}_{\text{clip}}$$, if necessary, truncates the n-gram count for each n-gram so that it does not exceed the largest count observed in any single reference for that n-gram. The clipped counts function is given by

$${\text{Count}}_{\text{clip}}(\text{n-gram})=\text{min}(\text{Count}(\text{n-gram}),\text{MaxRefCount}(\text{n-gram})),$$

where $$\text{Count}(\text{n-gram})$$ denotes the n-gram counts and $$\text{MaxRefCount}(\text{n-gram})$$ is the largest n-gram count observed in a single reference document for that n-gram.

The *modified n-gram precision scores* are given by

$${p}_{n}=\frac{{\displaystyle \sum _{C\in \left\{\text{Candidates}\right\}}{\displaystyle \sum _{\text{n-gram}\in C}{\text{Count}}_{\text{clip}}(\text{n-gram})}}}{{\displaystyle \sum _{C\text{'}\in \left\{\text{Candidates}\right\}}{\displaystyle \sum _{{\text{n-gram}}^{\prime}\in {C}^{\prime}}\text{Count}({\text{n-gram}}^{\prime})}}},$$

where *n* corresponds to the n-gram length and $$\left\{\text{candidates}\right\}$$ is the set of sentences in the candidate documents.

Given a vector of n-gram weights *w*, the *BLEU
score* is given by

$$\text{bleuScore}=\text{BP}\xb7\mathrm{exp}\left({\displaystyle \sum _{n=1}^{N}{w}_{n}}\mathrm{log}{\overline{p}}_{n}\right),$$

where *N* is the largest n-gram length, the entries in $$\overline{p}$$ correspond to the geometric averages of the modified n-gram precisions,
and $$\text{BP}$$ is the *brevity penalty* given by

$$\text{BP}=\{\begin{array}{cc}1& \text{if}cr\\ {e}^{1-\frac{r}{c}}& \text{if}c\le r\end{array}$$

where *c* is the length of the candidate document and
*r* is the length of the reference document with length closest to the
candidate length.

## References

[1] Papineni, Kishore, Salim Roukos,
Todd Ward, and Wei-Jing Zhu. "BLEU: A Method for Automatic Evaluation of Machine Translation."
In *Proceedings of the 40th annual meeting on association for computational
linguistics*, pp. 311-318. Association for Computational Linguistics,
2002.

## Version History

**Introduced in R2020a**

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

# Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)