visit
Like my articles? Feel free to vote for me
.
Sound analysis is a challenging task, associated to various modern applications, such as speech analytics, music information retrieval, speaker recognition, behavioral analytics and auditory scene analysis for security, health and environmental monitoring. This article provides a brief introduction to basic concepts of audio feature extraction, sound classification and segmentation, with demo examples in applications such as musical genre classification, speaker clustering, audio event classification and voice activity detection.
Python examples are provided in all cases, mostly through the library. All examples are also provided in github repo.
With regards to the involved ML methodologies, this article focuses on hand-crafted audio features and traditional statistical classifiers such as SVMs. Deep audio methods are to follow in a future article, as the present article is more about learning to extract audio features that make sense to your classifiers even when you have some tens of training samples.Before proceeding deeper to audio recognition, the reader needs to know the basics of audio handling and signal representation: sound definition, sampling, quantization, sampling frequency, sample resolution and the basics of frequency representation. Τhese topics are covered in this article.
So you should already know that an audio signal is represented by a sequence of samples at a given "sample resolution" (usually 16bits=2 bytes per sample) and with a particular sampling frequency (e.g. 16KHz = 16000 samples per second).
We can now proceed to the next step: use these samples to analyze the corresponding sounds. By "analyze" we can mean anything from: recognize between different types of sounds, segment an audio signal to homogeneous parts (e.g split voiced from unvoiced segments in a speech signal) or group sound files based on their content similarity.
In all cases, we first need to find a way to go from the low-level and voluminous audio data samples to a higher-level representation of the audio content. This is the purpose of feature extraction (FE), the most common and important task in all machine learning and pattern recognition applications.
FE is about extracting a set of features that are informative with respect to the desired properties of the original data. In our case, we are interested to extract audio features that are capable of discriminating between different audio classes, i.e. different speakers, events, emotions or musical genres, depending on the application subdomain.The most important concept of audio feature extraction is short-term windowing (or framing): this simply means that the audio signal is split into short-term windows (or frames). The frames can be optionally overlapping.
The length of the frames usually ranges from 10 to 100msecs depending on the application and types of signals. For the non-overlapping case, the step of the windowing procedure is equal to the window's length (also called "size").
If, on the other hand, step < size, then the frames are overlapping: e.g., a 10msec step for a 40msec window size means a 75% overlap. Usually, a window function (such as hamming) is also applied to each frame. For each frame (let N be the total number of frames), we extract a set of (short-term) audio features. When the features are directly extracted from the audio sample values, they are called time-domain. If the features are calculated on the FFT values, they are called frequency-domain features. Finally, cepstral features (such as ) are features that are based on the cepstrum.As an example, let's assume we only extract the signal's energy (mean of squares of the audio samples) and spectral centroid (the centroid of the FFT's magnitude). This means, that during this framing procedure, the signal is represented by a sequence of 2-D short-term feature vectors (or two equally-lengthed feature sequences, if you like).
So how can we use these arbitrary-size sequences to analyze the respective signal? Imagine you want to build a classifier to discriminate between two audio classes, say speech and silence. Your initial training data are audio files and corresponding class labels (one class label per whole audio file). If these files have the same duration, the corresponding short-term feature vector sequences will have the same length. What happens, though, in the general case of arbitrary durations?
A more common approach followed in traditional audio analysis is to extract a set of feature statistics per fix-sized segment. The segment-level statistics extracted over the short-term feature sequences are the representations for each fix-sized segment. The final signal representation can be the long-term average of the segment statistics.
... segment feature statistics is the simplest way to goAs an example, consider an audio signal of 2.5 seconds. We select a short-term window of 50 msecs and a 1-sec segment. According to the above, the energy and spectral centroid sequences will be extracted for each 1-sec segment. The length of the sequences N will be equal to 1 / 0.050 = 20. Then, the μ and σ of each sequence are extracted for each 1-sec segment, as the segment feature statistics. These are finally long-term averaged, resulting in the final signal representation. (Note that the last segment is 0.5 long, so the statistics are extracted on a shorter segment)
Example1 uses to read a WAV audio file and extract short-term feature sequences and plots the energy sequence (just one of the features). Please see inline comments for an explanation, along with these two notes:
read_audio_file()
returns the sampling rate (Fs) of the audio file and a NumPy array of the raw audio samples. To get the duration in seconds, one simply needs to divide the number of samples by FsShortTermFeatures.feature_extraction()
function returns (a) a 68 x 20 short-term feature matrix, where 68 is the number of short-term features implemented in the library and 20 is the number of frames that fit into the 1-sec segments (1-sec is used as mid-term window in the example) (b) a 68-length list of strings that contain the names of each feature implemented in the library. # Example 1: short-term feature extraction
from pyAudioAnalysis import ShortTermFeatures as aF
from pyAudioAnalysis import audioBasicIO as aIO
import numpy as np
import plotly.graph_objs as go
import plotly
import IPython
# read audio data from file
# (returns sampling freq and signal as a numpy array)
fs, s = aIO.read_audio_file("data/object.wav")
# play the initial and the generated files in notebook:
IPython.display.display(IPython.display.Audio("data/object.wav"))
# print duration in seconds:
duration = len(s) / float(fs)
print(f'duration = {duration} seconds')
# extract short-term features using a 50msec non-overlapping windows
win, step = 0.050, 0.050
[f, fn] = aF.feature_extraction(s, fs, int(fs * win),
int(fs * step))
print(f'{f.shape[1]} frames, {f.shape[0]} short-term features')
print('Feature names:')
for i, nam in enumerate(fn):
print(f'{i}:{nam}')
# plot short-term energy
# create time axis in seconds
time = np.arange(0, duration - step, win)
# get the feature whose name is 'energy'
energy = f[fn.index('energy'), :]
mylayout = go.Layout(yaxis=dict(title="frame energy value"),
xaxis=dict(title="time (sec)"))
plotly.offline.iplot(go.Figure(data=[go.Scatter(x=time,
y=energy)],
layout=mylayout))
duration = 1.03 seconds
20 frames, 68 short-term features
Feature names:
0:zcr
1:energy
2:energy_entropy
3:spectral_centroid
4:spectral_spread
5:spectral_entropy
6:spectral_flux
7:spectral_rolloff
8:mfcc_1
...
31:chroma_11
32:chroma_12
33:chroma_std
34:delta zcr
35:delta energy
...
66:delta chroma_12
67:delta chroma_std
Example2 demonstrates the spectral centroid short-term feature. Spectral centroid is simply the centroid of the FFT magnitude, normalized in the [0, Fs/2] frequency range (e.g, if Spectral Centroid = 0.5 this is equal to Fs/4 measured in Hz).
# Example 2: short-term feature extraction:
# spectral centroid of two speakers
from pyAudioAnalysis import ShortTermFeatures as aF
from pyAudioAnalysis import audioBasicIO as aIO
import numpy as np
import plotly.graph_objs as go
import plotly
import IPython
# read audio data from file
# (returns sampling freq and signal as a numpy array)
fs, s = aIO.read_audio_file("data/trump_bugs.wav")
# play the initial and the generated files in notebook:
IPython.display.display(IPython.display.Audio("data/trump_bugs.wav"))
# print duration in seconds:
duration = len(s) / float(fs)
print(f'duration = {duration} seconds')
# extract short-term features using a 50msec non-overlapping windows
win, step = 0.050, 0.050
[f, fn] = aF.feature_extraction(s, fs, int(fs * win),
int(fs * step))
print(f'{f.shape[1]} frames, {f.shape[0]} short-term features')
# plot short-term energy
# create time axis in seconds
time = np.arange(0, duration - step, win)
# get the feature whose name is 'energy'
energy = f[fn.index('spectral_centroid'), :]
mylayout = go.Layout(yaxis=dict(title="spectral_centroid value"),
xaxis=dict(title="time (sec)"))
plotly.offline.iplot(go.Figure(data=[go.Scatter(x=time,
y=energy)],
layout=mylayout))
In total, 34 short-term features are extracted in , for each frame, and the ShortTermFeatures.feature_extraction() function also (optionally) extracts the respective delta features. In that, case the total number of features extracted for each short-term frame is 68. The complete list and description of the short-term features can be found in the library's and .
The two first examples used function
ShortTermFeatures.feature_extraction()
to extract 68 features per short-term frame. As described in the previous section, in many cases, such as segment-level classification, we also extract segment-level statistics. This is achieved through the MidTermFeatures.mid_feature_extraction()
function, as shown in Example3:# Example 3: segment-level feature extraction
from pyAudioAnalysis import MidTermFeatures as aF
from pyAudioAnalysis import audioBasicIO as aIO
# read audio data from file
# (returns sampling freq and signal as a numpy array)
fs, s = aIO.read_audio_file("data/trump_bugs.wav")
# get mid-term (segment) feature statistics
# and respective short-term features:
mt, st, mt_n = aF.mid_feature_extraction(s, fs, 1 * fs, 1 * fs,
0.05 * fs, 0.05 * fs)
print(f'signal duration {len(s)/fs} seconds')
print(f'{st.shape[1]} {st.shape[0]}-D short-term feature vectors extracted')
print(f'{mt.shape[1]} {mt.shape[0]}-D segment feature statistic vectors extracted')
print('mid-term feature names')
for i, mi in enumerate(mt_n):
print(f'{i}:{mi}')
signal duration 3.812625 seconds
76 68-D short-term feature vectors extracted
4 136-D segment feature statistic vectors extracted
mid-term feature names
0:zcr_mean
1:energy_mean
2:energy_entropy_mean
3:spectral_centroid_mean
4:spectral_spread_mean
5:spectral_entropy_mean
6:spectral_flux_mean
7:spectral_rolloff_mean
8:mfcc_1_mean
...
131:delta chroma_9_std
132:delta chroma_10_std
133:delta chroma_11_std
134:delta chroma_12_std
135:delta chroma_std_std
MidTermFeatures.mid_feature_extraction()
extracts 2 statistics, namely the mean and std of each short-term feature sequence, using the provided "mid-term" (segment) window size of 1 sec for the example above. Since the duration of the signal is 3.8 sec, and the mid-term window step and size is 1 sec, we expect that 4 mid-term segments will be created and for each one of them a feature statistics vector will be calculated. Also, these segment statistics are computed on the short-term feature sequences of 3.8 / 0.05 = 76 short-term frames. Also, note that the mid-term feature names also contain the segment statistic, e.g. zcr_mean is the mean of the zero-crossing-rate short-term feature. The first 3 examples showed how we can extract short-term features and mid-term (segment) feature statistics. Function
MidTermFeatures.directory_feature_extraction()
extracts audio features for all files in the provided folder, so that these data can be used for training a classifier etc. So it actually calls
MidTermFeatures.mid_feature_extraction()
for each WAV file and it performs long-term averaging to go from segment feature statistic vectors to a single feature vector. Also, this function is capable of extracting two music beat-related features that are appended in the averaged segment statistics. As an example, let's suppose we want to analyze a song of 120 seconds, with a short-term window (and step) of 50 msecs and a mid-term (segment) window and step of 1 second. The following steps will occur during the
MidTermFeatures.directory_feature_extraction()
call:Example4 demonstrates the usage of
MidTermFeatures.directory_feature_extraction()
to extract file-level features (averages of segment feature statistics) for 20 2-sec music samples (separate WAV files) from two musical genre categories, namely classical and heavy metal. For each of the 2-sec song segment MidTermFeatures.directory_feature_extraction()
extracts the 138-D feature vector, as described above. Then we select to plot 2 from these features, using different colors for the two audio classes (classical and metal):# Example4: plot 2 features for 10 2-second samples
# from classical and 10 from metal music
from pyAudioAnalysis import MidTermFeatures as aF
import os
import numpy as np
import plotly.graph_objs as go
import plotly
dirs = ["data/music/classical", "data/music/metal"]
class_names = [os.path.basename(d) for d in dirs]
m_win, m_step, s_win, s_step = 1, 1, 0.1, 0.05
# segment-level feature extraction:
features = []
for d in dirs: # get feature matrix for each directory (class)
f, files, fn = aF.directory_feature_extraction(d, m_win, m_step,
s_win, s_step)
features.append(f)
# (each element of the features list contains a
# (samples x segment features) = (10 x 138) feature matrix)
print(features[0].shape, features[1].shape)
# select 2 features and create feature matrices for the two classes:
f1 = np.array([features[0][:, fn.index('spectral_centroid_mean')],
features[0][:, fn.index('energy_entropy_mean')]])
f2 = np.array([features[1][:, fn.index('spectral_centroid_mean')],
features[1][:, fn.index('energy_entropy_mean')]])
# plot 2D features
plots = [go.Scatter(x=f1[0, :], y=f1[1, :],
name=class_names[0], mode='markers'),
go.Scatter(x=f2[0, :], y=f2[1, :],
name=class_names[1], mode='markers')]
mylayout = go.Layout(xaxis=dict(title="spectral_centroid_mean"),
yaxis=dict(title="energy_entropy_mean"))
plotly.offline.iplot(go.Figure(data=plots, layout=mylayout))
In the example of the previous Section, we've seen how two features differentiated for two musical genre classes, from respective WAV files organized in two folders. Example5 shows how the same features can be used to train a simple SVM classifier: each point of a grid in the 2-D feature space is then classified to either of the two classes. This is a way of visualizing the decision surface of the classifier.
# Example5: plot 2 features for 10 2-second samples
# from classical and 10 from metal music.
# also train an SVM classifier and draw the respective
# decision surfaces
from pyAudioAnalysis import MidTermFeatures as aF
import os
import numpy as np
from sklearn.svm import SVC
import plotly.graph_objs as go
import plotly
dirs = ["data/music/classical", "data/music/metal"]
class_names = [os.path.basename(d) for d in dirs]
m_win, m_step, s_win, s_step = 1, 1, 0.1, 0.05
# segment-level feature extraction:
features = []
for d in dirs: # get feature matrix for each directory (class)
f, files, fn = aF.directory_feature_extraction(d, m_win, m_step,
s_win, s_step)
features.append(f)
# select 2 features and create feature matrices for the two classes:
f1 = np.array([features[0][:, fn.index('spectral_centroid_mean')],
features[0][:, fn.index('energy_entropy_mean')]])
f2 = np.array([features[1][:, fn.index('spectral_centroid_mean')],
features[1][:, fn.index('energy_entropy_mean')]])
# plot 2D features
p1 = go.Scatter(x=f1[0, :], y=f1[1, :], name=class_names[0],
marker=dict(size=10,color='rgba(255, 182, 193, .9)'),
mode='markers')
p2 = go.Scatter(x=f2[0, :], y=f2[1, :], name=class_names[1],
marker=dict(size=10,color='rgba(100, 100, 220, .9)'),
mode='markers')
mylayout = go.Layout(xaxis=dict(title="spectral_centroid_mean"),
yaxis=dict(title="energy_entropy_mean"))
y = np.concatenate((np.zeros(f1.shape[1]), np.ones(f2.shape[1])))
f = np.concatenate((f1.T, f2.T), axis = 0)
# train the svm classifier
cl = SVC(kernel='rbf', C=20)
cl.fit(f, y)
# apply the trained model on the points of a grid
x_ = np.arange(f[:, 0].min(), f[:, 0].max(), 0.002)
y_ = np.arange(f[:, 1].min(), f[:, 1].max(), 0.002)
xx, yy = np.meshgrid(x_, y_)
Z = cl.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) / 2
# and visualize the grid on the same plot (decision surfaces)
cs = go.Heatmap(x=x_, y=y_, z=Z, showscale=False,
colorscale= [[0, 'rgba(255, 182, 193, .3)'],
[1, 'rgba(100, 100, 220, .3)']])
mylayout = go.Layout(xaxis=dict(title="spectral_centroid_mean"),
yaxis=dict(title="energy_entropy_mean"))
plotly.offline.iplot(go.Figure(data=[p1, p2, cs], layout=mylayout))
Then, as with any other classification task, X can be split into train and test (using either random subsampling, fold cross-validation or leave-one-out) and for each train/test split an evaluation metric (such as F1, Recall, Precision, Accuracy or even the whole confusion matrix) is computed. Finally, we report the overall evaluation metric, as the average among all train/test splits (depending on the method). This procedure can be followed as described in many tutorials (IMO it is best in the scikit-learn's webpage), as soon as we have the audio features, as described in the previous examples.
Alternatively, provides a wrapped functionality that includes both feature extraction and classifier training. This is done in
audioTrainTest.extract_features_and_train()
function, which (a) first calls function
MidTermFeatures.multi_directory_feature_extraction()
, which calls MidTermFeatures.directory_feature_extraction()
, to extract feature matrices for all given folders of audio files (assuming each folder corresponds to a class)(b) generates X and y matrices, aka the feature matrix for the classification task and the respective class labels. (c) evaluates classifier for different parameters (e.g. C if SVM classifiers are selected) (d) returns printed evaluation results and saves the best model to a binary file (to be used by another function for testing as shown later)The following example trains an SVM classifier for the classical/metal music classification task:# Example6: use pyAudioAnalysis wrapper
# to extract feature and train SVM classifier
# for 20 music (10 classical/10 metal) song samples
from pyAudioAnalysis.audioTrainTest import extract_features_and_train
mt, st = 1.0, 0.05
dirs = ["data/music/classical", "data/music/metal"]
extract_features_and_train(dirs, mt, mt, st, st, "svm_rbf", "svm_classical_metal")
classical metal OVERALL
C PRE REC f1 PRE REC f1 ACC f1
0.001 79.4 81.0 80.2 80.6 79.0 79.8 80.0 80.0
0.010 77.2 78.0 77.6 77.8 77.0 77.4 77.5 77.5
0.500 76.5 75.0 75.8 75.5 77.0 76.2 76.0 76.0
1.000 88.2 75.0 81.1 78.3 90.0 83.7 82.5 82.4
5.000 100.0 83.0 90.7 85.5 100.0 92.2 91.5 91.4 best f1 best Acc
10.000 100.0 78.0 87.6 82.0 100.0 90.1 89.0 88.9
20.000 100.0 75.0 85.7 80.0 100.0 88.9 87.5 87.3
Confusion Matrix:
cla met
cla 41.50 8.50
met 0.00 50.00
Selected params: 5.00000
audioTrainTest.extract_features_and_train() from the pyAudioAnalysis lib, is a wrapper that (a) reads all audio files organized in a list of folders and extracts long-term averaged feature statistics (b) then trains a classifier assuming that folder names represent audio classes.
The trained model will be saved in
svm_classical_metal
(last argument of the function), along with the feature extraction parameters (short-term and segment window sizes and steps) . Note that another file is also created called svm_classical_metalMEANS
, that stores the normalization parameters, i.e. the mean and std used to normalize the audio features before training and testing. Finally, apart from SVMs the wrapper supports most scikit-learn classifiers such as decision trees and gradient boosting. So we have trained an audio classifier to distinguish between two audio classes (classical and metal) based on averages of feature statistics as described before. Now let's see how we can use the trained model to predict the class of an unknown audio file. Towards this end, we are going to use pyAudioAnalysis'
audioTrainTest.file_classification()
as shown in Example7:# Example7: use trained model from Example6
# to classify an unknown sample (song)
from pyAudioAnalysis import audioTrainTest as aT
files_to_test = ["data/music/test/classical.00095.au.wav",
"data/music/test/metal.00004.au.wav",
"data/music/test/rock.00037.au.wav"]
for f in files_to_test:
print(f'{f}:')
c, p, p_nam = aT.file_classification(f, "svm_classical_metal","svm_rbf")
print(f'P({p_nam[0]}={p[0]})')
print(f'P({p_nam[1]}={p[1]})')
print()
data/music/test/classical.00095.au.wav:
P(classical=0.63654)
P(metal=0.3634828441433037)
data/music/test/metal.00004.au.wav:
P(classical=0.15576)
P(metal=0.82566)
data/music/test/rock.00037.au.wav:
P(classical=0.2757302369241449)
P(metal=0.7242697630758552)
audioTrainTest.file_classification() gets the trained model, and the path of an unknown audio file and does feature extraction and classifier prediction for the unknown file, returning the (predicted) winner class, the classes posteriors and the respective classes names.
The previous example showed how we can apply the trained audio classifier to an unknown audio file to predict its audio label. In addition to that, pyAudioAnalysis provides function
audioTrainTest.evaluate_model_for_folders()
, which accepts a list of folders, assuming their basenames are class names (as we do during training), and repetitively applies a pre-trained classifier on the audio files of each folder. A the end it outputs performance metrics such as confusion matric and ROC curves:# Example8: use trained model from Example6
# to classify audio files organized in folders
# and evaluate the predictions, assuming that
# foldernames = classes names as during training
from pyAudioAnalysis import audioTrainTest as aT
aT.evaluate_model_for_folders(["data/music/test/classical",
"data/music/test/metal"],
"svm_classical_metal",
"svm_rbf",
"classical")
results in the following figures of the confusion matrix, precision/recall/f1 per class, and Precision/Recall curve and ROC curve for a "class of interest" (here we have provided classical). Note that the 3rd and 4th subplots evaluate the classifier as a detector of the class "classical" (last argument). For example, the last plot shows the true positive rate vs the false positive rate, and this is achieved by simulating thresholding of the posterior of the class of interest (classical): as the probability threshold rises, both true positive and false negative rates rise, the question is: how "steep" is the true positive rate's increase? More info about the ROC curve can be found . The precision/recall curve is equivalent to ROC, but it shows both metrics on the same graph on the y-axis for different thresholds of the posterior which is shown on the x-axis (more info ).
In our example, we can see that for a probability threshold of, say, 0.6 we can have a 100% Precision with around 80% Recall for classical: this means that all files detected will be indeed classical, while we will be "losing" almost 1 out of 5 "classical" song as metal. Another important note here is that there is no "best" operation point of the classifier, that depends on the individual application.
In this article, we demonstrate how regression can be used to detect a choral singing segment's pitch, without using any signal processing approach (e.g. the autocorrelation method). Towards this end, we have used part of the , which is a set of acapella recordings with respective pitch annotations. Here, we have selected to use this dataset to produce segment-level pitch annotations: we split the singing recordings to small (0.5 sec) segments and for each segment, we calculate the mean and standard deviation of the pitch (which provided by the dataset). These two metrics are f0_mean and f0_std respectively and are the two target regression values demonstrated in the following code. Below you can listen to a 0.5-sec sample with a low f0 and low f0 deviation:
Getting 0.5-sec segments from the leads to thousands of samples, but for demonstration purposes of this article, we have used around 120 training and 120 testing samples, available under
data/regression/f0/segments_train
and data/regression/f0/segments_train folders
of the github repo. Again, to demonstrate training and testing of the regression model we are using which treats audio segmentation in the following way:In our example,
data/regression/f0/segments_train
contains 120 WAV files and two 120-line CSV files named f0.csv and f0_std.csv. Each CSV corresponds to a separate regression task and each line of the CSV corresponds to the ground truth of the respective audio file. To train, evaluate, and save the two regression models for our example, the following code is used:# Example9:
# Train two linear SVM regression models
# that map song segments to pitch and pitch deviation
# The following function searches for .csv files in the
# input folder. For each csv of the format <filename>,<value>
# a separate regresion model is trained
from pyAudioAnalysis import audioTrainTest as aT
aT.feature_extraction_train_regression("data/regression/f0/segments_train",
0.5, 0.5, 0.05, 0.05,
"svm", "singing", False)
Since
data/regression/f0/segments_train
contains two CSVs, namely f0.csv
and f0_std.csv
, the above code results in two models: singing_f0
and singing_f0_std
(singing
prefix is provided as a 7th argument in the function above and is used for all trained models). Also, this is the result of the evaluation process executed internally by the
audioTrainTest.feature_extraction_train_regression()
function:Analyzing file 1 of 120: data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_263.1595.wav
Analyzing file 2 of 120: data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_264.957.wav
Analyzing file 3 of 120: data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_301.632.wav
Analyzing file 4 of 120: data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_328.748.wav
Analyzing file 5 of 120: data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_331.2835.wav
...
...
...
Analyzing file 119 of 120: data/regression/f0/segments_train/CSD_ND_alto_4.wav_segments_383.483.wav
Analyzing file 120 of 120: data/regression/f0/segments_train/CSD_ND_alto_4.wav_segments_394.315.wav
Feature extraction complexity ratio: 44.7 x realtime
Regression task f0_std
Param MSE T-MSE R-MSE
0.0010 736.98 10.46 661.43
0.0050 585.38 9.64 573.52
0.0100 522.73 9.17 539.87
0.0500 529.10 7.41 657.36
0.1000 379.13 6.73 541.03
0.2500 361.75 5.09 585.60
0.5000 323.20 3.88 522.12 best
1.0000 386.30 2.58 590.08
5.0000 782.14 0.99 548.65
10.0000 1140.95 0.47 529.20
Selected params: 0.50000
Regression task f0
Param MSE T-MSE R-MSE
0.0010 3103.83 44.65 3121.97
0.0050 2772.07 41.38 3098.40
0.0100 2293.79 37.57 2935.42
0.0500 1206.49 19.69 2999.49
0.1000 1012.29 13.94 3115.49
0.2500 839.82 8.64 3147.30
0.5000 758.04 5.62 2917.62
1.0000 689.12 3.53 3087.71 best
5.0000 892.52 1.07 3061.10
10.0000 1158.60 0.47 2889.27
Selected params: 1.00000
Now, once the two regression models are trained, evaluated and saved, we can use them to map any audio segment to either f0 or f0_std. Example10 demonstrates how to do this using
audioTrainTest.file_regression()
:# Example10
# load trained regression model for f0 and apply it to a folder
# of WAV files and evaluate (use csv file with ground truths)
import glob
import csv
import os
import numpy as np
import plotly.graph_objs as go
import plotly
from pyAudioAnalysis import audioTrainTest as aT
# read all files in testing folder:
wav_files_to_test = glob.glob("data/regression/f0/segments_test/*.wav")
ground_truths = {}
with open('data/regression/f0/segments_test/f0.csv', 'r') as file:
reader = csv.reader(file, delimiter = ',')
for row in reader:
ground_truths[row[0]] = float(row[1])
estimated_val, gt_val = [], []
for w in wav_files_to_test: # for each audio file
# get the estimates for all regression models starting with "singing"
values, tasks = aT.file_regression(w, "singing", "svm")
# check if there is ground truth available for the current file
if os.path.basename(w) in ground_truths:
# ... and append ground truth and estimated values
# for the f0 task
estimated_val.append(values[tasks.index('f0')])
gt_val.append(ground_truths[os.path.basename(w)])
# compute mean square error:
mse = ((np.array(estimated_val) - np.array(gt_val))**2).mean()
print(f'Testing MSE={mse}')
# plot real vs predicted results
p = go.Scatter(x=gt_val, y=estimated_val, mode='markers')
mylayout = go.Layout(xaxis=dict(title="f0 real"),
yaxis=dict(title="f0 predicted"),
showlegend=False)
plotly.offline.iplot(go.Figure(data=[p,
go.Scatter(x=[min(gt_val+
estimated_val),
max(gt_val+
estimated_val)],
y=[min(gt_val+
estimated_val),
max(gt_val+
estimated_val)])],
layout=mylayout))
In this example, we demonstrate how
audioTrainTest.file_regression()
can be used for a set of files from a test dataset. Note that this function returns decisions and task names for all available regression tasks that start with the provided prefix (in our case "singing"). The results are shown below: we can see that the real and predicted values are pretty close for the f0 task. Testing MSE=492.74
audioTrainTest.feature_extraction_train_regression() reads a folder of WAV files and assumes that each CSV of (<path>,<value>) format, is a regression ground-truth file. It then extracts audio features, trains and saves the respective number of models using a prefix (provided also as argument). audioTrainTest.file_regression() reads the saved models and returns predicted regression outputs for all tasks.
Until now we have seen how to train supervised models that map segment-level audio feature statistics to either class labels (audio classification) or real-valued targets (audio regression). Also, we have seen how to use these models to predict the label of an unknown audio file, e.g. a speech utterance or a whole song or a song's segment. In all these cases, the assumption followed was that the unknown audio signals belonged to a single label. For example, a song belongs to a particular genre, a singing segment has a particular pitch value and a speech utterance has a particular emotion. However, in real-world applications, there are many cases in which audio signals are not segments of homogeneous content, but complex audio streams that contain many successive segments of different content labels. A recording of a real-world dialog, for instance, is a sequence of labels of speaker identities or emotions.
Real-world recordings are not segments of homogeneous content but sequences of segments of different labels
For that reason, audio segmentation is an important step of audio analysis and it is about segmenting a long audio recording to a sequence of segments that are of homogeneous content. The definition of homogeneity is relative to the application domain: if, for example, we are interested in speaker recognition, a segment is considered homogeneous if it belongs to the same speaker.
In Example6 we had trained a model that classifies unknown music segments to "metal" and "classical" (model was saved in file
svm_classical_metal
). Let's use this model to segment a 30-sec recording that contains both metal and classical (non-overlapping) parts. This recording is stored in data/music/metal_classical_mix.wav
of the article's code. Also, data/music/metal_classical_mix.segment
contains the respective ground-truth annotation file of the format <start_segment_sec>\t<end_segment_sec>\t<segment_label>. This is the ground truth file:0 7.5 classical
7.5 15 metal
15 19 classical
19 29 metal
The fix-window supervised segmentation functionality is implemented in function
audioSegmentation.mid_term_file_classification()
, as shown in Example11:# Example 11
# Supervised audio segmentation example:
# - Apply model "svm_classical_metal" to achieve fix-sized, supervised audio segmentation
# on file data/music/metal_classical_mix.wav
# - Function audioSegmentation.mid_term_file_classification() uses pretrained model and applies
# the mid-term step that has been used when training the model (1 sec in our case as shown in Example6)
# - data/music/metal_classical_mix.segments contains the ground truth of the audio file
from pyAudioAnalysis.audioSegmentation import mid_term_file_classification, labels_to_segments
from pyAudioAnalysis.audioTrainTest import load_model
labels, class_names, _, _ = mid_term_file_classification("data/music/metal_classical_mix.wav",
"svm_classical_metal", "svm_rbf", True,
"data/music/metal_classical_mix.segments")
print("\nFix-sized segments:")
for il, l in enumerate(labels):
print(f'fix-sized segment {il}: {class_names[int(l)]}')
# load the parameters of the model (actually we just want the mt_step here):
cl, m, s, m_classes, mt_win, mt_step, s_win, s_step, c_beat = load_model("svm_classical_metal")
# print "merged" segments (use labels_to_segments())
print("\nSegments:")
segs, c = labels_to_segments(labels, mt_step)
for iS, seg in enumerate(segs):
print(f'segment {iS} {seg[0]} sec - {seg[1]} sec: {class_names[int(c[iS])]}')
audioSegmentation.mid_term_file_classification()
returns a list of label ids (one for each fix-sized segment window), a list of class names and the accuracy and confusion matrix (if ground truth is also provided, as in the example above). The labels
list corresponds to fix-sized segments of length equal to the segment step used during training of the model (1 second in the above example, according to Example6). That's why we use audioTrainTest.load_model()
, to load the segment window directly from the model file. Also, we use audioSegmentation.labels_to_segments()
to generate the list of final segments, based on the simple merging route (i.e concatenate successive 1-sec segments that have the same label). The output of the above code is the following (red corresponds to ground truth and blue to predicted segment labels):Overall Accuracy: 0.79
Fix-sized segments:
fix-sized segment 0: classical
fix-sized segment 1: classical
fix-sized segment 2: classical
fix-sized segment 3: classical
fix-sized segment 4: classical
fix-sized segment 5: classical
fix-sized segment 6: classical
fix-sized segment 7: metal
fix-sized segment 8: metal
fix-sized segment 9: metal
fix-sized segment 10: metal
fix-sized segment 11: metal
fix-sized segment 12: classical
fix-sized segment 13: metal
fix-sized segment 14: metal
fix-sized segment 15: classical
fix-sized segment 16: classical
fix-sized segment 17: classical
fix-sized segment 18: metal
fix-sized segment 19: metal
fix-sized segment 20: classical
fix-sized segment 21: metal
fix-sized segment 22: classical
fix-sized segment 23: classical
fix-sized segment 24: metal
fix-sized segment 25: metal
fix-sized segment 26: metal
fix-sized segment 27: metal
fix-sized segment 28: metal
fix-sized segment 29: metal
Segments:
segment 0 0.0 sec - 7.0 sec: classical
segment 1 7.0 sec - 12.0 sec: metal
segment 2 12.0 sec - 13.0 sec: classical
segment 3 13.0 sec - 15.0 sec: metal
segment 4 15.0 sec - 18.0 sec: classical
segment 5 18.0 sec - 20.0 sec: metal
segment 6 20.0 sec - 21.0 sec: classical
segment 7 21.0 sec - 22.0 sec: metal
segment 8 22.0 sec - 24.0 sec: classical
segment 9 24.0 sec - 29.0 sec: metal
audioSegmentation.labels_to_segments()
returns the more "compact" and useful information for the end "user". Also, note that the segmentation errors are either:(a) due to misclassifications of the segment classifier (e.g. segments 22 and 23 are misclassified as classical when their true label is metal or
(b) due to time resolution issues: e.g. according to ground truth, the 1st segment classical music ends at 7.5 sec, while our model is applied every 1 second, so the best this fix-window methodology will achieve is to either recognize classical until 7 or 8 sec. Obviously, this can be handled through a smaller step in the segment window (i.e. by introducing a segment overlap), however this will be with significant increase in computational demands (more segment-level predictions will take place).
Music segmentation
Extracting structural parts from a music track is a typical use-case where unsupervised audio analysis can be used. Since it is obviously rather difficult to have a classifier that distinguishes between song parts, we can answer the question: can you group song segments so that segments of the same group sound like they belong to the same song part? In the following example, M. Jacksons "Billie Jean" is used as input to the previously described segment-level feature extraction process and a simple k-means clustering is applied on the resulting feature vector sequences. Then, the segments of each cluster are concatenated into an artificial recording and saved to audio files. Each artificial "cluster recording" shows how song parts can be grouped and if this grouping makes some sense in terms of music structure. The code is shown in Example12:
# Example 12: Unsupervised Music Segmentation
#
# This example groups of song segments to clusters of similar content
import os, sklearn.cluster
from pyAudioAnalysis.MidTermFeatures import mid_feature_extraction as mT
from pyAudioAnalysis.audioBasicIO import read_audio_file, stereo_to_mono
from pyAudioAnalysis.audioSegmentation import labels_to_segments
from pyAudioAnalysis.audioTrainTest import normalize_features
import numpy as np
import scipy.io.wavfile as wavfile
import IPython
# read signal and get normalized segment feature statistics:
input_file = "data/music/billie_jean.wav"
fs, x = read_audio_file(input_file)
mt_size, mt_step, st_win = 5, 0.5, 0.1
[mt_feats, st_feats, _] = mT(x, fs, mt_size * fs, mt_step * fs,
round(fs * st_win), round(fs * st_win * 0.5))
(mt_feats_norm, MEAN, STD) = normalize_features([mt_feats.T])
mt_feats_norm = mt_feats_norm[0].T
# perform clustering
n_clusters = 5
x_clusters = [np.zeros((fs, )) for i in range(n_clusters)]
k_means = sklearn.cluster.KMeans(n_clusters=n_clusters)
k_means.fit(mt_feats_norm.T)
cls = k_means.labels_
# save clusters to concatenated wav files
segs, c = labels_to_segments(cls, mt_step) # convert flags to segment limits
for sp in range(n_clusters):
count_cl = 0
for i in range(len(c)): # for each segment in each cluster (>2 secs long)
if c[i] == sp and segs[i, 1]-segs[i, 0] > 2:
count_cl += 1
# get the signal and append it to the cluster's signal (followed by some silence)
cur_x = x[int(segs[i, 0] * fs): int(segs[i, 1] * fs)]
x_clusters[sp] = np.append(x_clusters[sp], cur_x)
x_clusters[sp] = np.append(x_clusters[sp], np.zeros((fs,)))
# write cluster's signal into a WAV file
print(f'cluster {sp}: {count_cl} segments {len(x_clusters[sp])/float(fs)} sec total dur')
wavfile.write(f'cluster_{sp}.wav', fs, np.int16(x_clusters[sp]))
IPython.display.display(IPython.display.Audio(f'cluster_{sp}.wav'))
This is clearly the chorus of the song, repeated twice (second time is much longer though as it includes more successive repetitions and a small solo)
Cluster 2 has a single segment that corresponds to the song's intro.
Cluster 3 is the pre-chorus of the song
The 4th cluster contains segments from the verses of the song (if you exclude the small segment in the beginning). The 5th cluster is not shown as it just included a very short almost-silent segment at the beginning of the song. In all cases, clusters represented (with some errors of course) structural song components, even using this very simple approach, and without making use of any "external" supervised knowledge, other than similar features may mean similar music content.
Clusters of song segments may correspond to stuctural song elements if appropriate audio features are usedFinally, note that, executing the code above may result in the same clustering but with different ordering of cluster IDs (and therefore order in the resulting audio files). This is probably due to the k-means random seed.
Speaker diarization
This is the task that, given an unknown speech recording, answers the question: "who speaks when?". For the sake of simplicity let's assume that we already know the number of speakers in the recording. What is the most straightforward way to solve this task? Obviously, first extract segment-level audio features and then perform some type of clustering, hoping that the resulting clusters will correspond to speaker IDs. In the following example (13), we use the exact same pipeline as the one followed in Example12, where we clustered a song to its structural parts. We have only changed the segment window size to 2 sec with a step of 0.1 sec and a smaller short-term window (50msec), since speech signals are, in general, characterized with faster changes in their main attributes, due to the existence of very different phonemes, some of which last just a few seconds (on the other hand musical note last several msecs, even in the fastest types of music). So Example13, uses the same rationalle of clustering of audio feature vectors. This time the input signal is a speech signal with 4 speakers (this is known beforehand), so we set our kmeans cluster size to 4:import os, sklearn.cluster
from pyAudioAnalysis.MidTermFeatures import mid_feature_extraction as mT
from pyAudioAnalysis.audioBasicIO import read_audio_file, stereo_to_mono
from pyAudioAnalysis.audioSegmentation import labels_to_segments
from pyAudioAnalysis.audioTrainTest import normalize_features
import numpy as np
import scipy.io.wavfile as wavfile
import IPython
# read signal and get normalized segment feature statistics:
input_file = "data/diarization_example.wav"
fs, x = read_audio_file(input_file)
mt_size, mt_step, st_win = 2, 0.1, 0.05
[mt_feats, st_feats, _] = mT(x, fs, mt_size * fs, mt_step * fs,
round(fs * st_win), round(fs * st_win * 0.5))
(mt_feats_norm, MEAN, STD) = normalize_features([mt_feats.T])
mt_feats_norm = mt_feats_norm[0].T
# perform clustering
n_clusters = 4
x_clusters = [np.zeros((fs, )) for i in range(n_clusters)]
k_means = sklearn.cluster.KMeans(n_clusters=n_clusters)
k_means.fit(mt_feats_norm.T)
cls = k_means.labels_
# save clusters to concatenated wav files
segs, c = labels_to_segments(cls, mt_step) # convert flags to segment limits
for sp in range(n_clusters):
count_cl = 0
for i in range(len(c)): # for each segment in each cluster (>2 secs long)
if c[i] == sp and segs[i, 1]-segs[i, 0] > 2:
count_cl += 1
# get the signal and append it to the cluster's signal (followed by some silence)
cur_x = x[int(segs[i, 0] * fs): int(segs[i, 1] * fs)]
x_clusters[sp] = np.append(x_clusters[sp], cur_x)
x_clusters[sp] = np.append(x_clusters[sp], np.zeros((fs,)))
# write cluster's signal into a WAV file
print(f'speaker {sp}: {count_cl} segments {len(x_clusters[sp])/float(fs)} sec total dur')
wavfile.write(f'diarization_cluster_{sp}.wav', fs, np.int16(x_clusters[sp]))
IPython.display.display(IPython.display.Audio(f'diarization_cluster_{sp}.wav'))