deepaudiox

This page provides the core API reference for DeepAudioX.

deepaudiox.AVAILABLE_BACKBONES: tuple[str, ...] = ('beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'): Supported pretrained backbone names available at runtime.

deepaudiox.AVAILABLE_POOLING: tuple[str, ...] = ('gap', 'simpool', 'ep'): Supported pooling layer names available at runtime.

class deepaudiox.AudioClassificationDataset(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]

Bases: Dataset

PyTorch Dataset for audio classification tasks.

This dataset loads audio files and returns a dictionary containing the raw waveform (under the key "feature"), the corresponding class name, and the integer class ID defined in class_mapping. The file_to_class_mapping argument must be a dictionary of the form:

{"abs/path/to/audio.wav": "class_name"}

Optionally, the dataset can segment each audio file into fixed-duration chunks using segment_duration. When enabled, each segment becomes an individual dataset sample.

file_to_class_mapping

Mapping from file paths to class names.

Type:: dict

sample_rate

Target sampling rate for audio loading.

Type:: int

class_mapping

Mapping from string class labels to integer IDs.

Type:: dict

Initialize the dataset.

Parameters:

file_to_class_mapping (dict[str | PathLike, str]) – Mapping from file paths to class names.
sample_rate (int) – Target sampling rate for audio loading.
class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.
segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Example

>>> from deepaudiox import AudioClassificationDataset
>>> file_to_class_mapping = {
...     "path/to/audio1.wav": "speech",
...     "path/to/audio2.wav": "music",
... }
>>> class_mapping = {"speech": 0, "music": 1}
>>> dataset = AudioClassificationDataset(
...     file_to_class_mapping=file_to_class_mapping,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
... )

__init__(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]

Initialize the dataset.

Parameters:

file_to_class_mapping (dict[str | PathLike, str]) – Mapping from file paths to class names.
sample_rate (int) – Target sampling rate for audio loading.
class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.
segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Example

>>> from deepaudiox import AudioClassificationDataset
>>> file_to_class_mapping = {
...     "path/to/audio1.wav": "speech",
...     "path/to/audio2.wav": "music",
... }
>>> class_mapping = {"speech": 0, "music": 1}
>>> dataset = AudioClassificationDataset(
...     file_to_class_mapping=file_to_class_mapping,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
... )

deepaudiox.AudioClassifier: alias of AudioClassifierConstructor

deepaudiox.Backbone: alias of BackboneConstructor

class deepaudiox.Evaluator(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]

Bases: object

The core SDK module for testing a model.

The Evaluator assembles all modules required for testing and performs the testing process.

state

Stores testing variables.

Type:: State

verbose

Whether to log the evaluation report after testing.

Type:: bool

device

The device used for testing.

Type:: str

class_mapping

A mapping between class names and IDs.

Type:: dict

logger

A module used for logging messages.

Type:: logging.Logger

test_dloader

The DataLoader of the testing set.

Type:: torch.DataLoader

model

An AudioClassifier module inheriting from BaseAudioClassifier.

Type:: BaseAudioClassifier

callbacks

A list of callbacks used throughout the testing lifecycle.

Type:: list

Initialize the Evaluator.

Parameters:

test_dset (AudioClassificationDataset) – The testing dataset.
model (BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.
class_mapping (dict) – A mapping between class names and IDs.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for evaluation. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.

Example

>>> import torch
>>> from deepaudiox import AudioClassifier, Evaluator
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> test_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> model.load_state_dict(torch.load("checkpoint.pt"))
>>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping)
>>> evaluator.evaluate()

__init__(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]

Initialize the Evaluator.

Parameters:

test_dset (AudioClassificationDataset) – The testing dataset.
model (BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.
class_mapping (dict) – A mapping between class names and IDs.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for evaluation. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.

Example

>>> import torch
>>> from deepaudiox import AudioClassifier, Evaluator
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> test_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> model.load_state_dict(torch.load("checkpoint.pt"))
>>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping)
>>> evaluator.evaluate()

evaluate()[source]

Run the full evaluation loop over the test set.

Iterates over all batches in test_dloader, accumulates true labels, predicted labels, and posterior probabilities into self.state, then triggers the registered callbacks via on_testing_end.

Always prints “Evaluation has finished.” regardless of verbose. The Reporter callback (classification report, confusion matrix, average posteriors) is only executed when verbose=True.

Return type:: None

After this method returns, self.state holds:

y_true (np.ndarray): Ground-truth class indices, shape (N,).
y_pred (np.ndarray): Predicted class indices, shape (N,).
posteriors (np.ndarray): Max posterior probability per sample, shape (N,).

Note

The model is expected to already be in eval mode (set in __init__). Runs under torch.inference_mode() — gradients are fully disabled.

class deepaudiox.Trainer(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]

Bases: object

The core SDK module for training a model.

The Trainer assembles all modules required for training and performs the training process.

state

Stores training variables.

Type:: State

epochs

The maximum number of training epochs.

Type:: int

verbose

Whether to log epoch-level artifacts.

Type:: bool

device

The device used for training.

Type:: str

logger

A module used for logging messages.

Type:: logging.Logger

train_dloader

The DataLoader of the training set.

Type:: torch.DataLoader

validation_dloader

The DataLoader of the validation set.

Type:: torch.DataLoader

model

The BaseAudioClassifier to be trained.

Type:: BaseAudioClassifier

optimizer

The optimizer of the training process.

Type:: torch.optim.Optimizer

scheduler

The learning rate scheduler of the training process.

Type:: LRScheduler

loss_function

The loss function used for optimization.

Type:: nn.Module

callbacks

A list of callbacks used throughout the training lifecycle.

Type:: list

Initialize the Trainer.

Parameters:

train_dset (AudioClassificationDataset) – The training dataset.
model (BaseAudioClassifier) – The model to be trained.
validation_dset (AudioClassificationDataset | None) – The validation dataset. If None, a split is created from train_dset using train_ratio.
optimizer (Optimizer | None) – The optimizer used for training. Adam if None.
learning_rate (float) – Learning rate used when optimizer is None. Defaults to 1e-3.
lr_scheduler (LRScheduler | None) – The scheduler used for training. ReduceLROnPlateau if None.
loss_function (Module | None) – The loss function used for training. Uses CrossEntropy if None.
train_ratio (float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.
epochs (int) – The maximum number of training epochs. Defaults to 100.
patience (int | None) – Epochs to wait without loss improvement before stopping. Disabled if None.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
path_to_checkpoint (str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for training. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> trainer.train()

__init__(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]

Initialize the Trainer.

Parameters:

train_dset (AudioClassificationDataset) – The training dataset.
model (BaseAudioClassifier) – The model to be trained.
validation_dset (AudioClassificationDataset | None) – The validation dataset. If None, a split is created from train_dset using train_ratio.
optimizer (Optimizer | None) – The optimizer used for training. Adam if None.
learning_rate (float) – Learning rate used when optimizer is None. Defaults to 1e-3.
lr_scheduler (LRScheduler | None) – The scheduler used for training. ReduceLROnPlateau if None.
loss_function (Module | None) – The loss function used for training. Uses CrossEntropy if None.
train_ratio (float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.
epochs (int) – The maximum number of training epochs. Defaults to 100.
patience (int | None) – Epochs to wait without loss improvement before stopping. Disabled if None.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
path_to_checkpoint (str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for training. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> trainer.train()

epoch_step()[source]

Run one complete training epoch.

Logs the epoch header and metrics when verbose=True, calls train_step() and val_step(), updates the LR scheduler and self.state, then executes on_epoch_end callbacks (which may trigger early stopping or checkpointing).

Note

self.state.current_epoch must be set by the caller before invoking this method — train() does this automatically. When calling epoch_step() directly, set it yourself: trainer.state.current_epoch = epoch.

Returns:: (train_loss, val_loss) for the epoch.
Return type:: tuple[float, float]

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> for epoch in range(1, trainer.epochs + 1):
...     trainer.state.current_epoch = epoch
...     train_loss, val_loss = trainer.epoch_step()
...     print(f"Epoch {epoch} — train: {train_loss:.4f}, val: {val_loss:.4f}")
...     if trainer.state.early_stop:
...         break

train()[source]

Perform the full training process.

Epoch-level output is controlled by verbose. The training summary (best epoch, losses, checkpoint path) is always printed on completion.

Return type:: None

train_step()[source]

Run one pass over the training set.

Sets the model to train mode, iterates over train_dloader, performs forward + backward + optimizer step per batch.

Returns:: Average training loss over the epoch.
Return type:: float

val_step()[source]

Run one pass over the validation set.

Sets the model to eval mode, iterates over validation_dloader under torch.no_grad().

Returns:: Average validation loss over the epoch.
Return type:: float

deepaudiox.audio_classification_dataset_from_dictionary(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]

Create an AudioClassificationDataset from a file-to-class mapping dictionary.

Parameters:

file_to_class_mapping (dict[str | PathLike, str]) – Mapping from file paths to class names.
sample_rate (int) – Target sampling rate for audio loading.
class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.
segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Returns:

The constructed dataset.

Return type:

AudioClassificationDataset

Example

>>> from deepaudiox import audio_classification_dataset_from_dictionary
>>> file_to_class_mapping = {
...     "path/to/audio1.wav": "speech",
...     "path/to/audio2.wav": "music",
... }
>>> class_mapping = {"speech": 0, "music": 1}
>>> dataset = audio_classification_dataset_from_dictionary(
...     file_to_class_mapping=file_to_class_mapping,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=None,
... )

deepaudiox.audio_classification_dataset_from_dir(root_dir, sample_rate, class_mapping, segment_duration=None)[source]

Create an AudioClassificationDataset from a directory structure.

Parameters:

root_dir (str) – Root directory containing class sub-folders. Only .wav and .mp3 files are used.
sample_rate (int) – Target sampling rate for audio loading.
class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.
segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Returns:

The constructed dataset.

Return type:

AudioClassificationDataset

Example

>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
... )

deepaudiox.get_class_mapping_from_dir(root_dir)[source]

Load the class mapping given a folder of class sub-folders.

Expected directory structure:

root_dir/
├── class_a/
│   ├── audio1.wav
│   └── audio2.wav
└── class_b/
    ├── audio3.wav
    └── audio4.wav

Parameters:: root_dir (str) – The path to root folder
Returns:: The class mapping dictionary, ordered alphabetically by folder name.
Return type:: dict[str, int]

Example

>>> from deepaudiox import get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> # Example output:
>>> # {'class_a': 0, 'class_b': 1}

deepaudiox.get_class_mapping_from_list(labels, sort_alphabetically=True)[source]

Get a class mapping dictionary given a list of class names.

Parameters:

labels (list[str]) – List of class names
sort_alphabetically (bool) – Determines if alphabetical sorting should be applied to class names.

Returns:

The class mapping dictionary

Return type:

dict[str, int]

Example

>>> from deepaudiox import get_class_mapping_from_list
>>> labels = ["speech", "music", "noise"]
>>> class_mapping = get_class_mapping_from_list(labels, sort_alphabetically=True)
>>> # Example output:
>>> # {'music': 0, 'noise': 1, 'speech': 2}