# editDistance

Find edit distance between two strings or documents

## Syntax

## Description

specifies additional options using one or more name-value pair arguments.`d`

= editDistance(___,`Name,Value`

)

## Examples

### Edit Distance Between Two Strings

Find the edit distance between the strings `"Text analytics"`

and `"Text analysis"`

. The edit distance, by default, is the total number of grapheme insertions, deletions, and substitutions required to change one string to another.

str1 = "Text analytics"; str2 = "Text analysis";

Find the edit distance.

d = editDistance(str1,str2)

d = 2

This means changing the first string to the second requires two edits. For example:

Substitution – Substitute the character

`"t"`

to an`"s"`

:`"Text analytics"`

to`"Text analysics"`

.Deletion – Delete the character

`"c"`

:`"Text analysics"`

to`"Text analysis"`

.

### Edit Distance Between Two Documents

Find the edit distance between two tokenized documents. For tokenized document input, the edit distance, by default, is the total number of token insertions, deletions, and substitutions required to change one document to another.

str1 = "It's time for breakfast."; document1 = tokenizedDocument(str1); str2 = "It's now time to sleep."; document2 = tokenizedDocument(str2);

Find the edit distance.

d = editDistance(document1,document2)

d = 3

This means changing the first document to the second requires three edits. For example:

Insertion – Insert the word

`"now"`

.Substitution – Substitute the word

`"for"`

with`"to"`

.Substitution – Substitute the word

`"breakfast"`

with`"sleep"`

.

### Specify Cost Values

The `editDistance`

function, by default, returns the lowest number of grapheme insertions, deletions, and substitutions required to change one string to another. To also include the swap action in the calculation, use the `'SwapCost'`

option.

First, find the edit distance between the strings `"MATALB"`

and `"MATLAB"`

.

str1 = "MATALB"; str2 = "MATLAB"; d = editDistance(str1,str2)

d = 2

One possible edit is:

Substitute the second

`"A"`

with`"L"`

: (`"MATALB"`

to`"MATLLB"`

).Substitute the second

`"L"`

with`"A"`

: (`"MATLLB"`

to`"MATLAB"`

).

The default value for the swap cost (the cost of swapping two adjacent graphemes) is `Inf`

. This means that swaps do not count towards the edit distance. To include swaps, set the `'SwapCost'`

option to 1.

`d = editDistance(str1,str2,'SwapCost',1)`

d = 1

This means there is one action. For example, swap the adjacent characters `"A"`

and `"L"`

.

### Specify Custom Cost Function

To compute the edit distance between two words and specify that the edits are case-insensitive, specify a custom substitute cost function.

First, compute the edit distance between the strings `"MATLAB"`

and `"MathWorks"`

.

d = editDistance("MATLAB","MathWorks")

d = 8

This means changing the first string to the second requires eight edits. For example:

Substitution – Substitute the character

`"A"`

with`"a"`

. (`"MATLAB"`

to`"MaTLAB"`

)Substitution – Substitute the character

`"T"`

with`"t"`

. (`"MaTLAB"`

to`"MatLAB"`

)Substitution – Substitute the character

`"L"`

with`"h"`

. (`"MatLAB"`

to`"MathAB"`

)Substitution – Substitute the character

`"A"`

with`"W"`

. (`"MathAB"`

to`"MathWB"`

)Substitution – Substitute the character

`"B"`

with`"o"`

. (`"MathWB"`

to`"MathWo"`

)Insertion – Insert the character

`"r"`

. (`"MathWo"`

to`"MathWor"`

)Insertion – Insert the character

`"k"`

. (`"MathWor"`

to`"MathWork"`

)Insertion – Insert the character

`"s"`

. (`"MathWork"`

to`"MathWorks"`

)

Compute the edit distance and specify the custom substitution cost function `caseInsensitiveSubstituteCost`

, listed at the end of the example. The custom function `caseInsensitiveSubstituteCost`

returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.

d = editDistance("MATLAB","MathWorks",'SubstituteCost',@caseInsensitiveSubstituteCost)

d = 6

This means the total cost for changing the first string to the second is 6. For example:

Substitution (cost 0) – Substitute the character

`"A"`

with`"a"`

. (`"MATLAB"`

to`"MaTLAB"`

)Substitution (cost 0) – Substitute the character

`"T"`

with`"t"`

. (`"MaTLAB"`

to`"MatLAB"`

)Substitution (cost 1) – Substitute the character

`"L"`

with`"h"`

. (`"MatLAB"`

to`"MathAB"`

)Substitution (cost 1) – Substitute the character

`"A"`

with`"W"`

. (`"MathAB"`

to`"MathWB"`

)Substitution (cost 1) – Substitute the character

`"B"`

with`"o"`

. (`"MathWB"`

to`"MathWo"`

)Insert (cost 1) – Insert the character

`"r"`

. (`"MathWo"`

to`"MathWor"`

)Insert (cost 1) – Insert the character

`"k"`

. (`"MathWor"`

to`"MathWork"`

)Insert (cost 1) – Insert the character

`"s"`

. (`"MathWork"`

to`"MathWorks"`

)

**Custom Cost Function**

The custom function `caseInsensitiveSubstituteCost`

returns 0 if the two inputs are the same or differ only by case and returns 1 otherwise.

function cost = caseInsensitiveSubstituteCost(grapheme1,grapheme2) if lower(grapheme1) == lower(grapheme2) cost = 0; else cost = 1; end end

## Input Arguments

`str1`

— Source string

string array | character vector | cell array of character vectors

Source string, specified as a string array, character vector, or a cell array of character vectors.

If `str1`

contains multiple strings, then
`str2`

must be the same size as `str1`

or
scalar.

**Data Types: **`char`

| `string`

| `cell`

`str2`

— Target string

string array | character vector | cell array of character vectors

Target string, specified as a string array, character vector, or a cell array of character vectors.

If `str2`

contains multiple strings, then
`str1`

must be the same size as `str2`

or
scalar.

**Data Types: **`char`

| `string`

| `cell`

`document1`

— Source document

`tokenizedDocument`

Source document, specified as a `tokenizedDocument`

array.

If `document1`

contains multiple documents, then
`document2`

must be the same size as `document1`

or scalar.

`document2`

— Target document

`tokenizedDocument`

Target document, specified as a `tokenizedDocument`

array.

If `document2`

contains multiple documents, then
`document1`

must be the same size as `document2`

or scalar.

### Name-Value Arguments

Specify optional pairs of arguments as
`Name1=Value1,...,NameN=ValueN`

, where `Name`

is
the argument name and `Value`

is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.

*
Before R2021a, use commas to separate each name and value, and enclose*
`Name`

*in quotes.*

**Example: **`editDistance("MATALB","MATLAB",'SwapCost',1)`

returns the edit
distance between the strings `"MATALB"`

and `"MATLAB"`

and
sets the cost to swap two adjacent graphemes to 1.

`InsertCost`

— Cost to insert grapheme or token

1 (default) | nonnegative scalar | function handle

Cost to insert a grapheme or token, specified as the comma-separated pair
consisting of `'InsertCost'`

and a nonnegative scalar or a function
handle.

If `'InsertCost'`

is a function handle, then the function must
accept a single input and return the cost of inserting the input to the source. For example:

For string input to

`editDistance`

, the cost function must have the form`cost = func(grapheme)`

, where the function returns the cost of inserting`grapheme`

into`str1`

.For document input to

`editDistance`

, the cost function must have the form`cost = func(token)`

, where the function returns the cost of inserting`token`

into`document1`

.

**Example: **`'InsertCost',2`

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `function_handle`

`DeleteCost`

— Cost to delete grapheme or token

1 (default) | nonnegative scalar | function handle

Cost to delete grapheme or token, specified as the comma-separated pair consisting
of `'DeleteCost'`

and a nonnegative scalar or a function
handle.

If `'DeleteCost'`

is a function handle, then the function must
accept a single input and return the cost of deleting the input from the source. For example:

For string input to

`editDistance`

, the cost function must have the form`cost = func(grapheme)`

, where the function returns the cost of deleting`grapheme`

from`str1`

.For document input to

`editDistance`

, the cost function must have the form`cost = func(token)`

, where the function returns the cost of deleting`token`

from`document1`

.

**Example: **`'DeleteCost',2`

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `function_handle`

`SubstituteCost`

— Cost to substitute grapheme or token

1 (default) | nonnegative scalar | function handle

Cost to substitute a grapheme or token, specified as the comma-separated pair consisting
of `'SubstituteCost'`

and a nonnegative scalar or a function
handle.

If `'SubstituteCost'`

is a function handle, then the function must
accept exactly two inputs and return the cost of substituting the first input with the
second in the source. For example:

For string input to

`editDistance`

, the cost function must have the form`cost = func(grapheme1,grapheme2)`

, where the function returns the cost of substituting`grapheme1`

with`grapheme2`

in`str1`

.For document input to

`editDistance`

, the cost function must have the form`cost = func(token1,token2)`

, where the function returns the cost of substituting`token1`

with`token2`

in`document1`

.

**Example: **`'SubstituteCost',2`

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `function_handle`

`SwapCost`

— Cost to swap two adjacent graphemes or tokens

`Inf`

(default) | nonnegative scalar | function handle

Cost to swap two adjacent graphemes or tokens, specified as the comma-separated
pair consisting of `'SwapCost'`

and a nonnegative scalar or a
function handle.

If `'SwapCost'`

is a function handle, then the function must
accept exactly two inputs and return the cost of swapping the first input with the
second in the source. For example:

For string input to

`editDistance`

, the cost function must have the form`cost = func(grapheme1,grapheme2)`

, where the function returns the cost of swapping the adjacent graphemes`grapheme1`

and`grapheme2`

in`str1`

.For document input to

`editDistance`

, the cost function must have the form`cost = func(token1,token2)`

, where the function returns the cost of swapping the adjacent tokens`token1`

and`token2`

in`document1`

.

**Example: **`'SwapCost',2`

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `function_handle`

## Output Arguments

`d`

— Edit distance

nonnegative scalar | vector of nonnegative values

## Algorithms

### Edit Distance

The function, by default, uses the Levenshtein distance: the lowest number of insertions, deletions, and substitutions required to convert one string to another.

For other commonly used edit distances, use these options:

Distance | Description | Options |
---|---|---|

Levenshtein (default) | lowest number of insertions, deletions, and substitutions | Default |

Damerau-Levenshtein | lowest number of insertions, deletions, substitutions, and swaps | `'SwapCost',1` |

Hamming | lowest number of substitutions only | `'InsertCost',Inf,'DeleteCost',Inf` |

## Version History

**Introduced in R2019a**

## See Also

`correctSpelling`

| `editDistanceSearcher`

| `knnsearch`

| `rangesearch`

| `splitGraphemes`

| `tokenizedDocument`

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)