Tutorial 02 - Loading Data from Unstructured Directory

In Tutorial 01 we assumed a specific folder structure to load the audio files and create a PyTorch Dataset. This is restrictive as in most cases the dataset comes in a folder containing all audio files and the individual splits are determined by some other structure (e.g., csv or json files, etc.). In this Tutorial we demonstrate an alternative and more Pythonic-way to load your data and create the Audio Classification Dataset.

1. Dataset Downloading & Inspection

For the purposes of this Tutorial we use the SpeechCommands dataset, we use a small version of the dataset consisting of 12 spoken english commands (e.g., “down”, “go”, “left”, etc.) from various speakers. More information about the dataset can be found in the HEAR evaluation benchmark dataset.

[1]:

# We download the dataset from zenodo using wget

!wget https://zenodo.org/records/5887964/files/hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1

--2026-04-17 15:24:25--  https://zenodo.org/records/5887964/files/hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1
Resolving zenodo.org (zenodo.org)... 188.185.43.153, 188.185.48.75, 188.184.103.118, ...
Connecting to zenodo.org (zenodo.org)|188.185.43.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1430299345 (1.3G) [application/octet-stream]
Saving to: ‘hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1’

hear2021-speech_com 100%[===================>]   1.33G  6.69MB/s    in 4m 15s

2026-04-17 15:28:40 (5.35 MB/s) - ‘hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1’ saved [1430299345/1430299345]

[ ]:

# We extract the downloaded tar.gz file and move the contents to the /data directory (folder should exist)
!tar -zxf ./hear2021-speech_commands-v0.0.2-5h-48000.tar.gz?download=1 -C /data

Now the dataset is available at /data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h. The folder contains the following files:

labelvocabulary.csv: Containing the class mapping between class names and integer values.
task_metadata.json: Metadata of the dataset
train.json: The audio filenames corresponding to the training set.
valid.json: The audio filenames corresponding to the validation set.
test.json: The audio filenames corresponding to the test set.

The folder 48000 contains three subfolders train, test, valid, each containing the respective audio files of the specified split in 48KHz sampling rate format.

[5]:

# We inspect the contests of the medatata file
import json
from pathlib import Path

DATA_PATH = Path("/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h/")
TRAIN_PATH = DATA_PATH / "48000" / "train"
TEST_PATH = DATA_PATH / "48000" / "test"
VALID_PATH = DATA_PATH / "48000" / "valid"

with open(DATA_PATH / "task_metadata.json", "r") as f:
    metadata = json.load(f)

metadata

[5]:

{'task_name': 'speech_commands',
 'version': 'v0.0.2',
 'embedding_type': 'scene',
 'prediction_type': 'multiclass',
 'split_mode': 'trainvaltest',
 'sample_duration': 1.0,
 'evaluation': ['top1_acc'],
 'download_urls': [{'split': 'train',
   'url': 'http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz',
   'md5': '6b74f3901214cb2c2934e98196829835'},
  {'split': 'test',
   'url': 'http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz',
   'md5': '854c580ee90bff80c516491c84544e32'}],
 'default_mode': '5h',
 'max_task_duration_by_split': {'train': 16000.0,
  'valid': 2000.0,
  'test': None},
 'tmp_dir': '_workdir',
 'mode': '5h',
 'splits': ['train', 'valid', 'test']}

Through the metadata we see that each audio is 1-second long. Therefore, we will set segment_duration=1.0 for creating the PyTorch dataset. Below we inspect the format of the json splitting files.

[7]:

with open(DATA_PATH / "train.json", "r") as f:
    train_json = json.load(f)

# Inspect the first entry in the train.json file
key, value = next(iter(train_json.items()))

print(key, value)

_silence__doing_the_dishes-1048000.wav ['_silence_']

We see that the json maps the filenames to the individual classes. We parse the json files for the validation / test splits in similar manner.

[8]:

with open(DATA_PATH / "test.json", "r") as f:
    test_json = json.load(f)

with open(DATA_PATH / "valid.json", "r") as f:
    valid_json = json.load(f)

2. Dataset Creation using Python Dictionaries

Now that we understand the structure of the dataset we can easily create the datasets. We first define the class_mapping through the labelvocabulary.csv file which is available.

[12]:

import csv

with open(DATA_PATH / "labelvocabulary.csv", "r") as f:
    reader = csv.reader(f)
    next(reader)  # Skip the header row
    label_mapping = {rows[0]: rows[1] for rows in reader}

class_mapping = {v: int(k) for k, v in label_mapping.items()}

class_mapping

[12]:

{'_silence_': 0,
 '_unknown_': 1,
 'down': 2,
 'go': 3,
 'left': 4,
 'no': 5,
 'off': 6,
 'on': 7,
 'right': 8,
 'stop': 9,
 'up': 10,
 'yes': 11}

To instantiate a PyTorch Dataset for audio classification we use the method audio_classification_dataset_from_dictionary. The method expects the same arguments as the audio_classification_dataset_from_dir with the exception that instead of a path we provide a Python dictionary of the form {"<abs_path_to_file>": "class_name"}. This is handled by the file_to_class_mapping argument. Luckily for us, this information is contained in the train_json, valid_json, and test_json variables defined previously.

[15]:

from deepaudiox import audio_classification_dataset_from_dictionary

# We only need to prepend the absolute path and index the class label for the dataset
train_json = {str(TRAIN_PATH / key): value[0] for key, value in train_json.items()}
valid_json = {str(VALID_PATH / key): value[0] for key, value in valid_json.items()}
test_json = {str(TEST_PATH / key): value[0] for key, value in test_json.items()}

train_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=train_json,
                                                          class_mapping=class_mapping,
                                                          sample_rate=32000,
                                                          segment_duration=1.0)

valid_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=valid_json,
                                                          class_mapping=class_mapping,
                                                          sample_rate=32000,
                                                          segment_duration=1.0)

test_dset = audio_classification_dataset_from_dictionary(file_to_class_mapping=test_json,
                                                          class_mapping=class_mapping,
                                                          sample_rate=32000,
                                                          segment_duration=1.0)

[17]:

# Check the first entry in the training dataset
print(train_dset[0])

{'path': '/data/hear-2021.0.6/tasks/speech_commands-v0.0.2-5h/48000/train/_silence__doing_the_dishes-1048000.wav', 'y_true': 0, 'class_name': '_silence_', 'segment_idx': 0, 'feature': array([ 0.01144081,  0.00943983,  0.00135719, ..., -0.01853629,
       -0.0183027 , -0.0120908 ], shape=(32000,), dtype=float32)}

[18]:

# Check the lengths of the datasets
print(f"Number of training samples: {len(train_dset)}")
print(f"Number of validation samples: {len(valid_dset)}")
print(f"Number of test samples: {len(test_dset)}")

Number of training samples: 16000
Number of validation samples: 2000
Number of test samples: 4890

3. Initializing the AudioClassifier

Now the rest is easy. The steps are Classifier Initialization -> Trainer -> Evaluator. We instantiate a simple audio classifier using MobileNet as backbone feature extractor - a lightweight CNN-based architecture enabling fast training. Since the backbone is lightweight we train it from scratch.

[20]:

from deepaudiox import AudioClassifier

model = AudioClassifier(backbone="mobilenet_10_as",
                        num_classes=len(class_mapping),
                        freeze_backbone=False,
                        pretrained=True,
                        sample_rate=32_000)

To see all the available backbones on the library use the AVAILABLE_BACKBONES variable lists all backbones.

[21]:

from deepaudiox import AVAILABLE_BACKBONES

[24]:

# Model Inspection
model

[24]:

AudioClassifierConstructor(
  (backbone_constructor): BackboneConstructor(
    (backbone): MobileNet(
      (features): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (2): Hardswish()
        )
        (1): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=16, bias=False)
              (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (2): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(16, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=64, bias=False)
              (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (2): Conv2dNormActivation(
              (0): Conv2d(64, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(24, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (3): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(24, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(72, 72, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=72, bias=False)
              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (2): Conv2dNormActivation(
              (0): Conv2d(72, 24, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(24, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (4): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(24, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(72, 72, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=72, bias=False)
              (1): BatchNorm2d(72, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=72, out_features=24, bias=True)
                  (fc2): Linear(in_features=24, out_features=72, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(72, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (5): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)
              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=120, out_features=32, bias=True)
                  (fc2): Linear(in_features=32, out_features=120, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(120, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (6): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(40, 120, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(120, 120, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=120, bias=False)
              (1): BatchNorm2d(120, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=120, out_features=32, bias=True)
                  (fc2): Linear(in_features=32, out_features=120, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(120, 40, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(40, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (7): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(40, 240, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(240, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(240, 240, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), groups=240, bias=False)
              (1): BatchNorm2d(240, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): Conv2dNormActivation(
              (0): Conv2d(240, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (8): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(80, 200, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(200, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(200, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=200, bias=False)
              (1): BatchNorm2d(200, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): Conv2dNormActivation(
              (0): Conv2d(200, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (9): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(80, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(184, 184, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=184, bias=False)
              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): Conv2dNormActivation(
              (0): Conv2d(184, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (10): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(80, 184, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(184, 184, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=184, bias=False)
              (1): BatchNorm2d(184, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): Conv2dNormActivation(
              (0): Conv2d(184, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(80, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (11): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(80, 480, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(480, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(480, 480, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=480, bias=False)
              (1): BatchNorm2d(480, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=480, out_features=120, bias=True)
                  (fc2): Linear(in_features=120, out_features=480, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(480, 112, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(112, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (12): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(112, 672, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(672, 672, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=672, bias=False)
              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=672, out_features=168, bias=True)
                  (fc2): Linear(in_features=168, out_features=672, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(672, 112, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(112, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (13): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(112, 672, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(672, 672, kernel_size=(5, 5), stride=(2, 2), padding=(2, 2), groups=672, bias=False)
              (1): BatchNorm2d(672, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=672, out_features=168, bias=True)
                  (fc2): Linear(in_features=168, out_features=672, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(672, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (14): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(960, 960, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=960, bias=False)
              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=960, out_features=240, bias=True)
                  (fc2): Linear(in_features=240, out_features=960, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (15): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(960, 960, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), groups=960, bias=False)
              (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
              (2): Hardswish()
            )
            (2): ConcurrentSEBlock(
              (conc_se_layers): ModuleList(
                (0): SqueezeExcitation(
                  (fc1): Linear(in_features=960, out_features=240, bias=True)
                  (fc2): Linear(in_features=240, out_features=960, bias=True)
                  (activation): ReLU()
                  (scale_activation): Sigmoid()
                )
              )
            )
            (3): Conv2dNormActivation(
              (0): Conv2d(960, 160, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(160, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
            )
          )
        )
        (16): Conv2dNormActivation(
          (0): Conv2d(160, 960, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(960, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (2): Hardswish()
        )
      )
      (feature_extractor): AugmentMelSTFT(
        (freqm): FrequencyMasking()
        (timem): TimeMasking()
      )
    )
    (pooling): GAP()
  )
  (classifier): MLPHead(
    (model): Sequential(
      (0): Linear(in_features=960, out_features=12, bias=True)
    )
  )
)

4. Training

Now we are ready to train our model for speech command classification. Note that in this case, the dataset comes with a predetermined validation dataset where we can utilize during training.

[25]:

from deepaudiox import Trainer

trainer = Trainer(model=model,
                  train_dset=train_dset,
                  validation_dset=valid_dset,
                  epochs=50,
                  batch_size=128,
                  patience=10)

trainer.train()

[Epoch 1/50]

Using GPU: NVIDIA GeForce RTX 4090

Epoch 1 | Train Loss: 1.5644 | Val. Loss: 1.5606 | Time: 3.32s
[CHECKPOINTER] Validation loss decreased: (inf --> 1.560594), (-nan%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 2/50]
Epoch 2 | Train Loss: 1.4823 | Val. Loss: 0.9147 | Time: 2.62s
[CHECKPOINTER] Validation loss decreased: (1.560594 --> 0.914667), (-41.39%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 3/50]
Epoch 3 | Train Loss: 1.3784 | Val. Loss: 1.4767 | Time: 2.66s
[Epoch 4/50]
Epoch 4 | Train Loss: 1.3140 | Val. Loss: 0.4436 | Time: 2.60s
[CHECKPOINTER] Validation loss decreased: (0.914667 --> 0.443601), (-51.50%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 5/50]
Epoch 5 | Train Loss: 1.3024 | Val. Loss: 0.3455 | Time: 2.65s
[CHECKPOINTER] Validation loss decreased: (0.443601 --> 0.345517), (-22.11%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 6/50]
Epoch 6 | Train Loss: 1.2830 | Val. Loss: 0.4031 | Time: 2.65s
[Epoch 7/50]
Epoch 7 | Train Loss: 1.2611 | Val. Loss: 0.2845 | Time: 2.62s
[CHECKPOINTER] Validation loss decreased: (0.345517 --> 0.284467), (-17.67%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 8/50]
Epoch 8 | Train Loss: 1.2377 | Val. Loss: 0.2525 | Time: 2.67s
[CHECKPOINTER] Validation loss decreased: (0.284467 --> 0.252507), (-11.23%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 9/50]
Epoch 9 | Train Loss: 1.2382 | Val. Loss: 0.2880 | Time: 2.68s
[Epoch 10/50]
Epoch 10 | Train Loss: 1.2242 | Val. Loss: 0.2928 | Time: 2.65s
[Epoch 11/50]
Epoch 11 | Train Loss: 1.2149 | Val. Loss: 0.2199 | Time: 2.61s
[CHECKPOINTER] Validation loss decreased: (0.252507 --> 0.219867), (-12.93%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 12/50]
Epoch 12 | Train Loss: 1.2129 | Val. Loss: 0.2184 | Time: 2.68s
[CHECKPOINTER] Validation loss decreased: (0.219867 --> 0.218392), (-0.67%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 13/50]
Epoch 13 | Train Loss: 1.2014 | Val. Loss: 0.1860 | Time: 2.70s
[CHECKPOINTER] Validation loss decreased: (0.218392 --> 0.185957), (-14.85%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 14/50]
Epoch 14 | Train Loss: 1.2113 | Val. Loss: 0.1765 | Time: 2.67s
[CHECKPOINTER] Validation loss decreased: (0.185957 --> 0.176475), (-5.10%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 15/50]
Epoch 15 | Train Loss: 1.1968 | Val. Loss: 0.1685 | Time: 2.74s
[CHECKPOINTER] Validation loss decreased: (0.176475 --> 0.168494), (-4.52%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 16/50]
Epoch 16 | Train Loss: 1.2047 | Val. Loss: 0.2046 | Time: 2.72s
[Epoch 17/50]
Epoch 17 | Train Loss: 1.1981 | Val. Loss: 0.1594 | Time: 2.68s
[CHECKPOINTER] Validation loss decreased: (0.168494 --> 0.159420), (-5.39%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 18/50]
Epoch 18 | Train Loss: 1.1948 | Val. Loss: 0.1701 | Time: 2.69s
[Epoch 19/50]
Epoch 19 | Train Loss: 1.1924 | Val. Loss: 0.1541 | Time: 2.63s
[CHECKPOINTER] Validation loss decreased: (0.159420 --> 0.154148), (-3.31%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 20/50]
Epoch 20 | Train Loss: 1.1851 | Val. Loss: 0.1478 | Time: 2.68s
[CHECKPOINTER] Validation loss decreased: (0.154148 --> 0.147844), (-4.09%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 21/50]
Epoch 21 | Train Loss: 1.1814 | Val. Loss: 0.1905 | Time: 2.67s
[Epoch 22/50]
Epoch 22 | Train Loss: 1.1673 | Val. Loss: 0.1482 | Time: 2.60s
[Epoch 23/50]
Epoch 23 | Train Loss: 1.1719 | Val. Loss: 0.1611 | Time: 2.66s
[Epoch 24/50]
Epoch 24 | Train Loss: 1.1771 | Val. Loss: 0.1800 | Time: 2.65s
[Epoch 25/50]
Epoch 25 | Train Loss: 1.1650 | Val. Loss: 0.1583 | Time: 2.67s
[Epoch 26/50]
Epoch 26 | Train Loss: 1.1611 | Val. Loss: 0.1416 | Time: 2.63s
[CHECKPOINTER] Validation loss decreased: (0.147844 --> 0.141557), (-4.25%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 27/50]
Epoch 27 | Train Loss: 1.1593 | Val. Loss: 0.1402 | Time: 2.68s
[CHECKPOINTER] Validation loss decreased: (0.141557 --> 0.140151), (-0.99%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 28/50]
Epoch 28 | Train Loss: 1.1600 | Val. Loss: 0.1407 | Time: 2.70s
[Epoch 29/50]
Epoch 29 | Train Loss: 1.1700 | Val. Loss: 0.1624 | Time: 2.67s
[Epoch 30/50]
Epoch 30 | Train Loss: 1.1484 | Val. Loss: 0.1412 | Time: 2.63s
[Epoch 31/50]
Epoch 31 | Train Loss: 1.1673 | Val. Loss: 0.1415 | Time: 2.63s
[Epoch 32/50]
Epoch 32 | Train Loss: 1.1712 | Val. Loss: 0.1409 | Time: 2.63s
[Epoch 33/50]
Epoch 33 | Train Loss: 1.1398 | Val. Loss: 0.1438 | Time: 2.68s
[Epoch 34/50]
Epoch 34 | Train Loss: 1.1542 | Val. Loss: 0.1260 | Time: 2.64s
[CHECKPOINTER] Validation loss decreased: (0.140151 --> 0.126029), (-10.08%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 35/50]
Epoch 35 | Train Loss: 1.1653 | Val. Loss: 0.1367 | Time: 2.70s
[Epoch 36/50]
Epoch 36 | Train Loss: 1.1419 | Val. Loss: 0.1463 | Time: 2.66s
[Epoch 37/50]
Epoch 37 | Train Loss: 1.1602 | Val. Loss: 0.1179 | Time: 2.64s
[CHECKPOINTER] Validation loss decreased: (0.126029 --> 0.117883), (-6.46%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 38/50]
Epoch 38 | Train Loss: 1.1494 | Val. Loss: 0.1095 | Time: 2.69s
[CHECKPOINTER] Validation loss decreased: (0.117883 --> 0.109453), (-7.15%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 39/50]
Epoch 39 | Train Loss: 1.1425 | Val. Loss: 0.1517 | Time: 2.70s
[Epoch 40/50]
Epoch 40 | Train Loss: 1.1475 | Val. Loss: 0.1156 | Time: 2.64s
[Epoch 41/50]
Epoch 41 | Train Loss: 1.1372 | Val. Loss: 0.2005 | Time: 2.67s
[Epoch 42/50]
Epoch 42 | Train Loss: 1.1491 | Val. Loss: 0.1219 | Time: 2.64s
[Epoch 43/50]
Epoch 43 | Train Loss: 1.1490 | Val. Loss: 0.1205 | Time: 2.67s
[Epoch 44/50]
Epoch 44 | Train Loss: 1.1481 | Val. Loss: 0.1335 | Time: 2.68s
[Epoch 45/50]
Epoch 45 | Train Loss: 1.1458 | Val. Loss: 0.1184 | Time: 2.66s
[Epoch 46/50]
Epoch 46 | Train Loss: 1.1441 | Val. Loss: 0.1129 | Time: 2.68s
[EARLY STOPPING] Elapsed epochs: 8 out of 10
[Epoch 47/50]
Epoch 47 | Train Loss: 1.1367 | Val. Loss: 0.1086 | Time: 2.66s
[CHECKPOINTER] Validation loss decreased: (0.109453 --> 0.108631), (-0.75%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 48/50]
Epoch 48 | Train Loss: 1.1329 | Val. Loss: 0.1046 | Time: 2.69s
[CHECKPOINTER] Validation loss decreased: (0.108631 --> 0.104579), (-3.73%).
[CHECKPOINTER] Checkpoint saved successfully at: checkpoint.pt
[Epoch 49/50]
Epoch 49 | Train Loss: 1.1461 | Val. Loss: 0.1272 | Time: 2.71s
[Epoch 50/50]
Epoch 50 | Train Loss: 1.1353 | Val. Loss: 0.1148 | Time: 2.65s
Training has finished.

5. Evaluation

In similar manner as in the first tutorial, we use the Evaluator to check the performance on the held-out test set.

[26]:

from deepaudiox import Evaluator

# First load the best model checkpoint
model = AudioClassifier.from_checkpoint("checkpoint.pt")

evaluator = Evaluator(model=model, test_dset=test_dset, class_mapping=class_mapping)

evaluator.evaluate()

Using GPU: NVIDIA GeForce RTX 4090

Testing has finished.
[REPORTER] Class mapping: {'_silence_': 0, '_unknown_': 1, 'down': 2, 'go': 3, 'left': 4, 'no': 5, 'off': 6, 'on': 7, 'right': 8, 'stop': 9, 'up': 10, 'yes': 11}

[REPORTER] Classification Report:

              precision    recall  f1-score   support

   _silence_       1.00      0.94      0.97       408
   _unknown_       0.60      0.99      0.75       408
        down       0.96      0.85      0.90       406
          go       0.98      0.80      0.88       402
        left       0.98      0.92      0.95       412
          no       0.91      0.93      0.92       405
         off       0.97      0.92      0.95       402
          on       0.99      0.87      0.93       396
       right       1.00      0.93      0.96       396
        stop       1.00      0.99      0.99       411
          up       0.97      0.96      0.97       425
         yes       0.99      0.98      0.99       419

    accuracy                           0.92      4890
   macro avg       0.95      0.92      0.93      4890
weighted avg       0.95      0.92      0.93      4890

[REPORTER] Confusion Matrix:

[[385  23   0   0   0   0   0   0   0   0   0   0]
 [  0 405   0   1   0   2   0   0   0   0   0   0]
 [  0  42 347   1   1  15   0   0   0   0   0   0]
 [  0  44  13 320   2  21   0   0   0   0   2   0]
 [  0  29   0   0 379   0   0   0   0   0   1   3]
 [  0  21   3   3   0 378   0   0   0   0   0   0]
 [  0  20   0   0   0   0 371   3   0   1   7   0]
 [  0  42   0   0   0   0   8 345   0   0   1   0]
 [  0  27   0   0   2   0   0   0 367   0   0   0]
 [  0   4   0   0   0   0   0   0   0 407   0   0]
 [  0  15   0   0   0   0   3   0   0   0 407   0]
 [  0   5   0   0   1   0   1   0   0   0   0 412]]
[REPORTER] Average Posteriors:

_silence_           : 0.987
_unknown_           : 0.989
down                : 0.967
go                  : 0.926
left                : 0.977
no                  : 0.951
off                 : 0.978
on                  : 0.961
right               : 0.975
stop                : 0.997
up                  : 0.982
yes                 : 0.994

[ ]: