pitchnn

Estimate pitch with deep learning neural network

collapse all in page

Syntax

f0 = pitchnn(audioIn,fs)

f0 = pitchnn(audioIn,fs,Name,Value)

[f0,loc] = pitchnn(___)

[f0,loc,activations] = pitchnn(___)

pitchnn(___)

Description

f0 = pitchnn(audioIn,fs) returns estimates of the fundamental frequency over time for audioIn with sample rate fs. Columns of the input are treated as individual channels.

example

f0 = pitchnn(audioIn,fs,Name,Value) specifies options using one or more Name,Value arguments. For example, f0 = pitchnn(audioIn,fs,'ConfidenceThreshold',0.5) sets the confidence threshold for each value of f0 to 0.5.

[f0,loc] = pitchnn(___) returns the time values, loc, associated with each fundamental frequency estimate.

[f0,loc,activations] = pitchnn(___) returns the activations of a CREPE pretrained network.

pitchnn(___) with no output arguments plots the estimated fundamental frequency over time.

Examples

collapse all

Download `pitchnn` Functionality

This example uses:

Open Live Script

Download and unzip the Audio Toolbox™ model for CREPE to use pitchnn.

Type pitchnn at the Command Window. If the Audio Toolbox model for CREPE is not installed, then the function provides a link to the location of the network weights. To download the model, click the link and unzip the file to a location on the MATLAB® path.

Alternatively, execute these commands to download and unzip the CREPE model to your temporary directory.

downloadFolder = fullfile(tempdir,'crepeDownload');
loc = websave(downloadFolder,'https://ssd.mathworks.com/supportfiles/audio/crepe.zip');
crepeLocation = tempdir;
unzip(loc,crepeLocation)
addpath(fullfile(crepeLocation,'crepe'))

Pitch Estimation with `pitchnn`

This example uses:

Open Live Script

The CREPE network requires you to preprocess your audio signals to generate buffered, overlapped, and normalized audio frames that can be used as input to the network. This example demonstrates the pitchnn function performing all of these steps for you.

Read in an audio signal for pitch estimation. Visualize and listen to the audio. There are nine vocal utterances in the audio clip.

[audioIn,fs] = audioread('SingingAMajor-16-mono-18secs.ogg');
soundsc(audioIn,fs)
T = 1/fs;
t = 0:T:(length(audioIn)*T) - T;
plot(t,audioIn);
grid on
axis tight
xlabel('Time (s)')
ylabel('Ampltiude')
title('Singing in A Major')

Use the pitchnn function to produce the pitch estimate using a CREPE network with ModelCapacity set to tiny and ConfidenceThreshold disabled. Calling pitchnn with no output arguments plots the pitch estimation over time. If you call pitchnn before downloading the model, an error is printed to the Command Window with a download link.

pitchnn(audioIn,fs,'ModelCapacity','tiny','ConfidenceThreshold',0)

With confidence thresholding disabled, pitchnn provides a pitch estimate for every frame. Increase the ConfidenceThreshold to 0.8.

pitchnn(audioIn,fs,'ModelCapacity','tiny','ConfidenceThreshold',0.8)

Call pitchnn with ModelCapacity set to full. There are nine primary pitch estimation groupings, each group corresponding with one of the nine vocal utterances.

pitchnn(audioIn,fs,'ModelCapacity','full','ConfidenceThreshold',0.8)

Call spectrogram and compare the frequency content of the signal with the pitch estimates from pitchnn. Use a frame size of 250 samples and an overlap of 225 samples or 90%. Use 4096 DFT points for the transform.

spectrogram(audioIn,250,225,4096,fs,'yaxis')

Input Arguments

collapse all

`audioIn` — Input signal
column vector | matrix

Input signal, specified as a column vector or matrix. If you specify a matrix, pitchnn treats the columns of the matrix as individual audio channels.

Data Types: single | double

`fs` — Sample rate (Hz)
positive scalar

Sample rate of the input signal in Hz, specified as a positive scalar.

Data Types: single | double

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: pitchnn(audioIn,fs,'OverlapPercentage',50) sets the percent overlap between consecutive audio frames to 50.

`OverlapPercentage` — Overlap percentage between consecutive audio frames
`85` (default) | nonnegative scalar in the range [0,100)

Percentage overlap between consecutive audio frames, specified as a scalar in the range [0,100).

Data Types: single | double

`ConfidenceThreshold` — Confidence threshold
`0.5` (default) | nonnegative scalar in the range [0,1)

Confidence threshold for each value of f0, specified as a scalar in the range [0,1).

To disable threshold, set this argument to 0.

Note

If the maximum value of the corresponding activations vector is less than 'ConfidenceThreshold', f0 is NaN.

Data Types: single | double

`ModelCapacity` — Model Capacity
`'full'` (default) | `'tiny'` | `'small'` | `'medium'` | `'large'`

Model capacity, specified as 'tiny', 'small', 'medium', 'large', or 'full'.

Tip

'ModelCapacity' controls the complexity of the underlying deep learning neural network. The higher the model capacity, the greater the number of nodes and layers in the model.

Data Types: string | char

Output Arguments

collapse all

`f0` — Estimated fundamental frequency
N-by-C array

Estimated fundamental frequency in Hertz, returned as an N-by-C array, where N is the number of fundamental frequency estimates and C is the number of channels in audioIn.

Data Types: single

`loc` — Time values
`1`-by-N vector

Time values associated with each f0 estimate, returned as a 1-by-N vector, where N is the number of fundamental frequency estimates. The time values correspond to the most recent samples used to compute the estimates.

Data Types: single | double

`activations` — CREPE network activations
N-by-`360`-by-C matrix

Activations from the CREPE network, returned as an N-by-360-by-C matrix, where N is the number of generated frames from the network and C is the number of channels in audioIn.

Data Types: single | double

References

[1] Kim, Jong Wook, Justin Salamon, Peter Li, and Juan Pablo Bello. “Crepe: A Convolutional Representation for Pitch Estimation.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 161–65. Calgary, AB: IEEE, 2018. https://doi.org/10.1109/ICASSP.2018.8461329.

Extended Capabilities

expand all

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2021a

pitchnn

Syntax

Description

Examples

Download pitchnn Functionality

Pitch Estimation with pitchnn

Input Arguments

audioIn — Input signal column vector | matrix

fs — Sample rate (Hz) positive scalar

Name-Value Arguments

OverlapPercentage — Overlap percentage between consecutive audio frames 85 (default) | nonnegative scalar in the range [0,100)

ConfidenceThreshold — Confidence threshold 0.5 (default) | nonnegative scalar in the range [0,1)

ModelCapacity — Model Capacity 'full' (default) | 'tiny' | 'small' | 'medium' | 'large'

Output Arguments

f0 — Estimated fundamental frequency N-by-C array

loc — Time values 1-by-N vector

activations — CREPE network activations N-by-360-by-C matrix

References

Extended Capabilities

GPU Arrays Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.

Version History

See Also

Download `pitchnn` Functionality

Pitch Estimation with `pitchnn`

`audioIn` — Input signal
column vector | matrix

`fs` — Sample rate (Hz)
positive scalar

`OverlapPercentage` — Overlap percentage between consecutive audio frames
`85` (default) | nonnegative scalar in the range [0,100)

`ConfidenceThreshold` — Confidence threshold
`0.5` (default) | nonnegative scalar in the range [0,1)

`ModelCapacity` — Model Capacity
`'full'` (default) | `'tiny'` | `'small'` | `'medium'` | `'large'`

`f0` — Estimated fundamental frequency
N-by-C array

`loc` — Time values
`1`-by-N vector

`activations` — CREPE network activations
N-by-`360`-by-C matrix

GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.