seqpdist

Calculate pairwise distance between sequences

Syntax

``D = seqpdist(Seqs)``
``D = seqpdist(Seqs,Name=Value)``

Description

````D = seqpdist(Seqs)` returns the biological distances between each pair of sequences stored in `Seqs`.```
````D = seqpdist(Seqs,Name=Value)` adjusts aspects of the calculation using name-value arguments.```

example

Examples

collapse all

Read amino acid alignment data into a structure.

`seqs = fastaread("pf00002.fa");`

For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix PAM250.

```dist = seqpdist(seqs,Method="alignment-score", ... Indels="pairwise-delete",... ScoringMatrix="pam250");```

Force the realignment of each sequence pair ignoring the provided multiple alignment.

```dist = seqpdist(seqs,Method="alignment-score",... Indels="pairwise-delete", ... ScoringMatrix="pam250", ... PairwiseAlignment=1);```

Measure the Jukes-Cantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.

```dist = seqpdist(seqs,Method="jukes-cantor", ... Indels="score", ... ScoringMatrix="pam250", ... PairwiseAlignment=true);```

Input Arguments

collapse all

Nucleotide or amino acid sequences, specified as a cell array of character vectors, vector of strings, matrix of characters, or vector of structures.

You can specify:

• Cell array of character vectors or vector of strings containing nucleotide or amino acid sequences.

• Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence.

• Vector of structures containing a `Sequence` field.

Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: `D = seqpdist(Seqs,Method="p-distance")` calculates the pairwise distance using the p-distance method.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: `D = seqpdist(Seqs,"Method","p-distance")`

Method to calculate pairwise distances, specified by a character vector or string scalar in the table.

Methods for Nucleotides and Amino Acids

MethodDescription
`p-distance`

Proportion of sites at which the two sequences are different. `p` is close to `1` for poorly related sequences, and `p` is close to `0` for similar sequences.

`d = p`

`Jukes-Cantor` (default)

Maximum likelihood estimate of the number of substitutions between two sequences. `p` is described with the method `p-distance`.

For nucleotides:

``d = -3/4 log(1-p * 4/3)``

For amino acids:

``d = -19/20 log(1-p * 20/19)``
`alignment-score`

Distance (`d`) between two sequences (```1, 2```) is computed from the pairwise alignment score between the two sequences (`score12`), and the pairwise alignment score between each sequence and itself (`score11`, `score22`) as follows:

`d = (1-score12/score11)* (1-score12/score22) `
This method does not imply that prealigned input sequences will be realigned, it only scores them. Use with care; this distance method does not comply with the ultrametric condition. In the rare case where the score between sequences is greater than the score when aligning a sequence with itself, then `d = 0`.

Methods with No Scoring of Gaps (Nucleotides Only)

MethodDescription
`Tajima-Nei`

Maximum likelihood estimate considering the background nucleotide frequencies.

The `seqpdist` function can calculate the background nucleotide frequencies from the input sequences. Instead, if you want to specify the nucleotide frequencies, set the `OptArgs` name-value argument to ```[gA gC gG gT]```. The elements `gA`, `gC`, `gG`, `gT` are scalar values for the nucleotide frequencies.

`Kimura`Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution.
`Tamura`

Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the GC content.

The `seqpdist` function can calculate the GC content from the input sequences. Instead, if you want to specify the proportion of GC content, set the `OptArgs` name-value argument to a numeric scalar in the range [0, 1].

`Hasegawa`

Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the background nucleotide frequencies.

The `seqpdist` function can calculate the background nucleotide frequencies from the input sequences. Instead, if you want to specify the nucleotide frequencies, set the `OptArgs` name-value argument to ```[gA gC gG gT]```. The elements `gA`, `gC`, `gG`, `gT` are scalar values for the nucleotide frequencies.

`Nei-Tamura`

Considers separately the transitional nucleotide substitution between purines, the transitional nucleotide substitution between pyrimidines, the transversional nucleotide substitution, and the background nucleotide frequencies.

The `seqpdist` function can calculate the background nucleotide frequencies from the input sequences. Instead, if you want to specify the nucleotide frequencies, set the `OptArgs` name-value argument to ```[gA gC gG gT]```. The elements `gA`, `gC`, `gG`, `gT` are scalar values for the nucleotide frequencies.

Methods with No Scoring of Gaps (Amino Acids Only)

MethodDescription
`Poisson`Assumes that the number of amino acid substitutions at each site has a Poisson distribution.
`Gamma`

Assumes that the number of amino acid substitutions at each site has a Gamma distribution with parameter `a`.

By default, the `seqpdist` function sets the value of `a` as `2`. Instead, if you want to specify the value of `a`, set the `OptArgs` name-value argument to a numeric scalar.

You can also specify a user-defined distance function using `@`, for example, `@distfun`. The distance function must have the form:

`function D = distfun(S1, S2, OptionalArgs)`

The `distfun` function takes the following arguments:

• `S1` , `S2` — Two sequences of the same length (nucleotide or amino acid).

• `OptionalArgs` — Optional problem-dependent arguments.

The `distfun` function returns a scalar that represents the distance between `S1` and `S2`.

Data Types: `char` | `string` | `function_handle`

How to treat sites with gaps, specified as one of these strings or character vectors.

• `score` — Scores these sites either as a point mutation or with the alignment parameters, depending on the method selected.

• `pairwise-del` — For every pairwise comparison, it ignores the sites with gaps.

• `complete-del` — Ignores all the columns in the multiple alignment that contain a gap. This option is available only if you provided a multiple alignment as the input `Seqs`.

Data Types: `char` | `string`

Optional arguments for the distance method, specified as a numeric scalar or a four-element numeric vector.

Methods with No Scoring of Gaps (Nucleotides Only)

MethodDescription
`Tajima-Nei`Background nucleotide frequencies, specified as a 4-element numeric vector of the form `[gA gC gG gT]`.
`Tamura`Proportion of GC content, specified as a number in the range [0, 1].
`Hasegawa`Background nucleotide frequencies, specified as a 4-element numeric vector of the form `[gA gC gG gT]`.
`Nei-Tamura`Background nucleotide frequencies, specified as a 4-element numeric vector of the form `[gA gC gG gT]`.

Methods with No Scoring of Gaps (Amino Acids Only)

MethodDescription
`Gamma`Shape parameter `a` of the Gamma distribution, specified as a numeric scalar.

Controls the global pairwise alignment of input sequences, specified as a numeric or logical `true` (`1`) or `false` (`0`). When `true`, the `seqpdist` function performs global pairwise alignment using the `nwalign` function, while ignoring the multiple alignment of the input sequences (if any).

The default value depends on the length of the sequences:

• `true` — When all input sequences do not have the same length.

• `false` — When all input sequences have the same length.

Tip

If your input sequences are the same length, then `seqpdist` assumes they are aligned. If they are not aligned, do one of the following:

• Align the sequences before passing them to `seqpdist`, for example, using the `multialign` function.

• Set `PairwiseAlignment` to `true` when using `seqpdist`.

Use parallel computation to calculate the distance, specified as a numeric or logical `false` (`0`) or `true` (`1`).

• If `true`, and Parallel Computing Toolbox™ is installed, then computation occurs using `parfor`-loops.

• If a `parpool` is open, then the computation uses the open `parpool` and occurs in parallel.

• If there are no open `parpool`, but automatic creation is enabled in the Parallel Preferences, then the default pool will be automatically opened and computation occurs in parallel.

• If there are no open `parpool` and automatic creation is disabled, then computation uses `parfor`-loops in serial mode.

• If Parallel Computing Toolbox is not installed, then computation uses `parfor`-loops in serial mode.

• If `false`, then the computation uses for-loops in serial mode.

Return the output `D` as a square matrix, specified as a numeric or logical `false` (`0`) or `true` (`1`).

When `true`, the `seqpdist` function converts the output into a square matrix such that `D(I,J)` denotes the distance between the `I`th and `J`th sequences. The square matrix is symmetric and has a zero diagonal. Setting `Squareform` to `true` is the same as using the `squareform` function in Statistics and Machine Learning Toolbox™.

Type of sequence, specified as `"AA"` (amino acid) or `"NT"` (nucleotide).

Data Types: `char` | `string`

Scoring matrix to use for the global pairwise alignment, specified as a character vector, string scalar, or numeric matrix.

You can specify a scoring matrix name. Valid choices are:

• `"BLOSUM50"` (default for amino acid sequences)

• `"NUC44"` (default for nucleotide sequences). This choice is not supported for amino acid sequences.

• `"BLOSUM62"`

• `"BLOSUM30"` increasing by `5` up to `"BLOSUM90"`

• `"BLOSUM100"`

• `"PAM10"` increasing by `10` up to `"PAM500"`

• `"DAYHOFF"`

• `"GONNET"`

Note

The above scoring matrices, provided with the software, also include a scale factor that converts the units of the output score to bits. You can also specify the `Scale` name-value argument to specify an additional scale factor to convert the output score from bits to another unit.

You can also specify a numeric matrix, such as the one returned by the `blosum`, `pam`, `dayhoff`, `gonnet`, or `nuc44` function.

Note

• If you use a scoring matrix that you created or was created by one of these scoring matrix functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. You can use the `Scale` name-value argument to specify a scale factor to convert the output score to another unit.

• If you need to compile `seqpdist` into a standalone application or software component using MATLAB® Compiler™, use a matrix instead of a character vector or string for `ScoringMatrix`.

You can specify this argument only when the `Method` argument is `"alignment-score"` or the `PairwiseAlignment` argument is `true`.

Scale factor applied to the output distance, specified as a positive number. By default, there is no scaling or change in the units of the output distance. If the scoring matrix information also provides a scale factor, then both are used.

Use this argument to control the units of the output distance.

You can specify this argument only when the `Method` argument is `"alignment-score"` or the `PairwiseAlignment` argument is `true`.

Penalty for opening a gap in the alignment, specified as a positive integer.

You can specify this argument only when the `Method` argument is `"alignment-score"` or the `PairwiseAlignment` argument is `true`.

Penalty for extending a gap in the alignment, specified as a positive integer. The default is equal to `GapOpen`.

You can specify this argument only when the `Method` argument is `"alignment-score"` or the `PairwiseAlignment` argument is `true`.

Output Arguments

collapse all

Biological distance between all pairs of sequences stored in the `M` elements of `Seqs`, returned as a numeric row vector or a numeric matrix.

By default, `D` is a row vector of length `1`-by-`(M*(M-1)/2)`. The elements are arranged in the order `((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M-1))`. This is the lower-left triangle of the full `M`-by-`M` distance matrix. To get the distance between the `I`th and the `J`th sequences for `I > J`, use the formula `D((J-1)*(M-J/2)+I-J)`.

When you specify the `SquareForm` name-value argument as `true`, `D` is the full `M`-by-`M` distance matrix.

Data Types: `double`

Version History

Introduced before R2006a