Main Content


Extract OpenL3 features



    embeddings = openl3Features(audioIn,fs) returns OpenL3 feature embeddings over time for audio input audioIn with sample rate fs. Columns of the input are treated as individual channels.


    embeddings = openl3Features(audioIn,fs,Name,Value) specifies options using one or more Name,Value arguments. For example, embeddings = openl3Features(audioIn,fs,'OverlapPercentage',75) applies a 75% overlap between consecutive frames used to create the audio embeddings.

    This function requires both Audio Toolbox™ and Deep Learning Toolbox™.


    collapse all

    Download and unzip the Audio Toolbox™ model for OpenL3.

    Type openl3Features at the command line. If the Audio Toolbox model for OpenL3 is not installed, the function provides a link to the location of the network weights. To download the model, click the link. Unzip the file to a location on the MATLAB path.

    Alternatively, execute the following commands to download and unzip the OpenL3 model to your temporary directory.

    downloadFolder = fullfile(tempdir,'OpenL3Download');
    loc = websave(downloadFolder,'');
    OpenL3Location = tempdir;

    Read in an audio file.

    [audioIn,fs] = audioread('MainStreetOne-16-16-mono-12secs.wav');

    Call the openl3Features function with the audio and sample rate to extract OpenL3 feature embeddings from the audio.

    featureVectors = openl3Features(audioIn,fs);

    The openl3Features function returns a matrix of 512-element feature vectors over time.

    [numHops,numElementsPerHop,numChannels] = size(featureVectors)
    numHops = 111
    numElementsPerHop = 512
    numChannels = 1

    Create a 10-second pink noise signal and then extract OpenL3 features. The openl3Features function extracts features from mel spectrograms with 90% overlap.

    fs = 16e3;
    dur = 10;
    audioIn = pinknoise(dur*fs,1,'single');
    features = openl3Features(audioIn,fs);

    Plot the OpenL3 features over time.

    view([30 65])
    axis tight
    xlabel('Feature Index')
    xlabel('Feature Value')
    title('OpenL3 Features')

    To decrease the resolution of OpenL3 features over time, specify the percent overlap between mel spectrograms. Plot the results.

    overlapPercentage = 10;
    features = openl3Features(audioIn,fs,'OverlapPercentage',overlapPercentage);
    view([30 65])
    axis tight
    xlabel('Feature Index')
    zlabel('Feature Value')
    title('OpenL3 Features')

    Input Arguments

    collapse all

    Input signal, specified as a column vector or matrix. If you specify a matrix, openl3Features treats the columns of the matrix as individual audio channels.

    Data Types: single | double

    Sample rate of the input signal in Hz, specified as a positive scalar.

    Data Types: single | double

    Name-Value Pair Arguments

    Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

    Example: openl3Features(audioIn,fs,'SpectrumType','mel256')

    Percentage overlap between consecutive spectrograms, specified as a scalar in the range [0,100).

    Data Types: single | double

    Spectrum type generated from audio and used as input to the neural network, specified as 'mel128', 'mel256', or 'linear'.


    The SpectrumType that you select controls the spectrogram used in the network. See openl3 or openl3Preprocess for more details.

    Data Types: char | string

    Length of the output audio embedding, specified as '512' or '6144'.

    Data Types: single | double

    Audio content type the neural network is trained on, specified as 'env' or 'music'.

    Set ContentType to:

    • 'env' when you want to use a model trained on environmental data.

    • 'music' when you want to use a model trained on musical data.

    Data Types: char | string

    Output Arguments

    collapse all

    Compact representation of audio data, returned as an N-by-L-by-C array, where:

    • N –– Represents the number of buffered frames the audio signal is partitioned into and depends on the length of audioIn and the 'OverlapPercentage'.

    • L –– Represents the audio embedding length.

    • C –– Represents the number of input channels.

    Data Types: single


    [1] Cramer, Jason, et al. "Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings." In ICASSP 2019 IEEE International Conference on Aoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 3852-56. (Crossref), doi:/10.1109/ICASSP.2019.8682475.

    Introduced in R2021a