API Reference

Dataset Construction

Methods for building datasets and class mappings from directories or label lists.

deepaudiox.get_class_mapping_from_dir(root_dir)[source]

Load the class mapping given a folder of class sub-folders.

Expected directory structure:

root_dir/
├── class_a/
│   ├── audio1.wav
│   └── audio2.wav
└── class_b/
    ├── audio3.wav
    └── audio4.wav

Parameters:: root_dir (str) – The path to root folder
Returns:: The class mapping dictionary, ordered alphabetically by folder name.
Return type:: dict[str, int]

Example

>>> from deepaudiox import get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> # Example output:
>>> # {'class_a': 0, 'class_b': 1}

deepaudiox.get_class_mapping_from_list(labels, sort_alphabetically=True)[source]

Get a class mapping dictionary given a list of class names.

Parameters:

labels (list[str]) – List of class names
sort_alphabetically (bool) – Determines if alphabetical sorting should be applied to class names.

Returns:

The class mapping dictionary

Return type:

dict[str, int]

Example

>>> from deepaudiox import get_class_mapping_from_list
>>> labels = ["speech", "music", "noise"]
>>> class_mapping = get_class_mapping_from_list(labels, sort_alphabetically=True)
>>> # Example output:
>>> # {'music': 0, 'noise': 1, 'speech': 2}

class deepaudiox.AudioClassificationDataset(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]

Bases: Dataset

PyTorch Dataset for audio classification tasks.

This dataset loads audio files and returns a dictionary containing the raw waveform (under the key "feature"), the corresponding class name, and the integer class ID defined in class_mapping. The file_to_class_mapping argument must be a dictionary of the form:

{"abs/path/to/audio.wav": "class_name"}

Optionally, the dataset can segment each audio file into fixed-duration chunks using segment_duration. When enabled, each segment becomes an individual dataset sample.

file_to_class_mapping

Mapping from file paths to class names.

Type:: dict

sample_rate

Target sampling rate for audio loading.

Type:: int

class_mapping

Mapping from string class labels to integer IDs.

Type:: dict

Initialize the dataset.

Parameters:

file_to_class_mapping (dict[str | PathLike, str]) – Mapping from file paths to class names.
sample_rate (int) – Target sampling rate for audio loading.
class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.
segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Example

>>> from deepaudiox import AudioClassificationDataset
>>> file_to_class_mapping = {
...     "path/to/audio1.wav": "speech",
...     "path/to/audio2.wav": "music",
... }
>>> class_mapping = {"speech": 0, "music": 1}
>>> dataset = AudioClassificationDataset(
...     file_to_class_mapping=file_to_class_mapping,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
... )

deepaudiox.audio_classification_dataset_from_dir(root_dir, sample_rate, class_mapping, segment_duration=None)[source]

Create an AudioClassificationDataset from a directory structure.

Parameters:

root_dir (str) – Root directory containing class sub-folders. Only .wav and .mp3 files are used.
sample_rate (int) – Target sampling rate for audio loading.
class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.
segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Returns:

The constructed dataset.

Return type:

AudioClassificationDataset

Example

>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
... )

deepaudiox.audio_classification_dataset_from_dictionary(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]

Create an AudioClassificationDataset from a file-to-class mapping dictionary.

Parameters:

file_to_class_mapping (dict[str | PathLike, str]) – Mapping from file paths to class names.
sample_rate (int) – Target sampling rate for audio loading.
class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.
segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Returns:

The constructed dataset.

Return type:

AudioClassificationDataset

Example

>>> from deepaudiox import audio_classification_dataset_from_dictionary
>>> file_to_class_mapping = {
...     "path/to/audio1.wav": "speech",
...     "path/to/audio2.wav": "music",
... }
>>> class_mapping = {"speech": 0, "music": 1}
>>> dataset = audio_classification_dataset_from_dictionary(
...     file_to_class_mapping=file_to_class_mapping,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=None,
... )

Models & Backbones

Constructors for initializing classifiers and backbones.

class deepaudiox.modules.constructors.AudioClassifierConstructor(num_classes, backbone, pooling=None, freeze_backbone=False, sample_rate=16000, classifier_hidden_layers=None, activation='relu', apply_batch_norm=True, pretrained=False)[source]

Bases: BaseAudioClassifier, BackbonePoolingResolverMixin

Classifier model using a backbone for feature extraction.

backbone_constructor

Backbone model with optional pooling method.

Type:: BackboneConstructor

classifier

Classifier head for final predictions.

Type:: MLPHead

config

Constructor arguments used to build this model. Used by from_checkpoint to reconstruct the model from a saved checkpoint.

Type:: dict

Initialize the AudioClassifierConstructor.

Parameters:

num_classes (int) – Number of output classes.
backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone model to use for feature extraction. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.
pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer to aggregate features.
freeze_backbone (bool) – Whether to freeze the backbone weights during training.
sample_rate (int) – Sample frequency for audio input.
classifier_hidden_layers (list[int] | None) – Hidden layer sizes for the classifier head.
activation (Literal['relu', 'gelu', 'tanh', 'leakyrelu']) – Activation function for the classifier head.
apply_batch_norm (bool) – Whether to apply batch normalization in the classifier head.
pretrained (bool) – Whether to load pretrained weights for the backbone. If pooling is None, GAP is used by default.

Example

>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(
...     num_classes=10,
...     backbone="beats",
...     pooling=None,
...     freeze_backbone=True,
...     sample_rate=16000,
...     classifier_hidden_layers=[512, 256],
...     activation="relu",
...     apply_batch_norm=True,
...     pretrained=True,
... )

Note

Available as deepaudiox.AudioClassifier.

__init__(num_classes, backbone, pooling=None, freeze_backbone=False, sample_rate=16000, classifier_hidden_layers=None, activation='relu', apply_batch_norm=True, pretrained=False)[source]

Initialize the AudioClassifierConstructor.

Parameters:

num_classes (int) – Number of output classes.
backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone model to use for feature extraction. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.
pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer to aggregate features.
freeze_backbone (bool) – Whether to freeze the backbone weights during training.
sample_rate (int) – Sample frequency for audio input.
classifier_hidden_layers (list[int] | None) – Hidden layer sizes for the classifier head.
activation (Literal['relu', 'gelu', 'tanh', 'leakyrelu']) – Activation function for the classifier head.
apply_batch_norm (bool) – Whether to apply batch normalization in the classifier head.
pretrained (bool) – Whether to load pretrained weights for the backbone. If pooling is None, GAP is used by default.

Example

>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(
...     num_classes=10,
...     backbone="beats",
...     pooling=None,
...     freeze_backbone=True,
...     sample_rate=16000,
...     classifier_hidden_layers=[512, 256],
...     activation="relu",
...     apply_batch_norm=True,
...     pretrained=True,
... )

forward(x)[source]

Forward pass through the classifier.

Parameters:: x (torch.Tensor) – Input waveforms of shape (B, T)
Returns:: Logits of shape (B, num_classes)
Return type:: Tensor

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> logits = model.forward(waveforms)
>>> # logits shape: (B, num_classes)

forward_backbone(x)[source]

Extract feature map from the backbone.

Parameters:: x (torch.Tensor) – Input waveforms of shape (B, T).
Returns:: Returns the feature map of the backbone model (B, T, D) or (B, D, H, W).
Return type:: Tensor

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> features = model.forward_backbone(waveforms)
>>> # features shape: (B, N, D) for Transformer or (B, D, H, W) for CNN backbones

forward_with_pooling(x)[source]

Forward pass through backbone and pooling.

Parameters:: x (Tensor) – x (torch.Tensor): Input waveforms of shape (B, T).
Returns:: Pooled tensor of shape (B, D).
Return type:: Tensor

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> embeddings = model.forward_with_pooling(waveforms)
>>> # embeddings shape: (B, D)

classmethod from_checkpoint(path)[source]

Load an AudioClassifierConstructor from a checkpoint saved by the Checkpointer.

Parameters:: path (str) – Path to the checkpoint file.
Returns:: Model with weights and config restored.
Return type:: AudioClassifierConstructor

Example

>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier.from_checkpoint("checkpoint.pt")
>>> print(model.config)

class deepaudiox.modules.constructors.BackboneConstructor(backbone, pretrained=False, freeze_backbone=False, pooling=None, sample_rate=16000, norm_p=None)[source]

Bases: Module, BackbonePoolingResolverMixin

Backbone model wrapper with optional pooling and normalization.

backbone

Backbone model for feature extraction.

Type:: BaseBackbone

pooling

Pooling layer applied to the backbone feature map.

Type:: BasePooling

norm_p

Optional Lp normalization applied after pooling.

Type:: float or None

out_dim

Dimension of the backbone model feature map.

Type:: int

config

Constructor arguments used to build this model. Used by from_checkpoint to reconstruct the model from a saved checkpoint.

Type:: dict

Initialize the BackboneConstructor.

Parameters:

backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone name or instance. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.
pretrained (bool) – Whether to load pretrained weights for the backbone.
freeze_backbone (bool) – Whether to freeze the backbone weights during training.
pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer for aggregation.
sample_rate (int) – Sample frequency for audio input.
norm_p (float | None) – Optional Lp norm applied after pooling. If pooling is None, GAP is used.

Example

>>> from deepaudiox import Backbone
>>> backbone = Backbone(
...     backbone="beats",
...     pretrained=True,
...     freeze_backbone=True,
...     pooling="gap",
...     sample_rate=16000,
...     norm_p=2.0,
... )

Note

Available as deepaudiox.Backbone.

__init__(backbone, pretrained=False, freeze_backbone=False, pooling=None, sample_rate=16000, norm_p=None)[source]

Initialize the BackboneConstructor.

Parameters:

backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone name or instance. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.
pretrained (bool) – Whether to load pretrained weights for the backbone.
freeze_backbone (bool) – Whether to freeze the backbone weights during training.
pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer for aggregation.
sample_rate (int) – Sample frequency for audio input.
norm_p (float | None) – Optional Lp norm applied after pooling. If pooling is None, GAP is used.

Example

>>> from deepaudiox import Backbone
>>> backbone = Backbone(
...     backbone="beats",
...     pretrained=True,
...     freeze_backbone=True,
...     pooling="gap",
...     sample_rate=16000,
...     norm_p=2.0,
... )

extract_features(waveforms)[source]

Extract backbone-specific features from raw waveforms.

Parameters:: waveforms (Tensor) – Input waveforms of shape (B, T).
Returns:: Model-specific input features.
Return type:: Tensor

forward(x)[source]

Forward pass through the backbone.

Parameters:: x (Tensor) – Input waveforms of shape (B, T).
Returns:: Backbone feature map of shape (B, N, D) or (B, D, H, W).
Return type:: Tensor

Example

>>> import torch
>>> from deepaudiox import Backbone
>>> backbone = Backbone(backbone="beats", pretrained=True, sample_rate=16_000)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> features = backbone.forward(waveforms)
>>> # features shape: (B, N, D) for Transformer or (B, D, H, W) for CNN backbones

forward_with_pooling(x)[source]

Forward pass through backbone and pooling (with optional normalization).

Parameters:: x (Tensor) – Input waveforms of shape (B, T).
Returns:: Pooled tensor of shape (B, D).
Return type:: Tensor

Example

>>> import torch
>>> from deepaudiox import Backbone
>>> backbone = Backbone(backbone="beats", pretrained=True, pooling="gap", sample_rate=16_000)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> embeddings = backbone.forward_with_pooling(waveforms)
>>> # embeddings shape: (B, D)

classmethod from_checkpoint(path)[source]

Load a BackboneConstructor from a checkpoint saved by the Checkpointer.

Parameters:: path (str) – Path to the checkpoint file.
Returns:: Model with weights and config restored.
Return type:: BackboneConstructor

Example

>>> from deepaudiox import Backbone
>>> backbone = Backbone.from_checkpoint("checkpoint.pt")
>>> print(backbone.config)

Supported Backbones & Pooling

Type aliases and runtime constants for valid backbone and pooling names.

deepaudiox.AVAILABLE_BACKBONES = ("beats", "passt", "mobilenet_05_as", "mobilenet_10_as", "mobilenet_40_as"): Supported pretrained backbone names available at runtime.

deepaudiox.AVAILABLE_POOLING = ("gap", "simpool", "ep"): Supported pooling layer names available at runtime.

deepaudiox.BackboneName: Type alias: Literal["beats", "passt", "mobilenet_05_as", "mobilenet_10_as", "mobilenet_40_as"]. Use for type-annotated code.

deepaudiox.PoolingName: Type alias: Literal["gap", "simpool", "ep"]. Use for type-annotated code.

Training & Evaluation

Interfaces for training models and evaluating performance on held-out data.

class deepaudiox.Trainer(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]

Bases: object

The core SDK module for training a model.

The Trainer assembles all modules required for training and performs the training process.

state

Stores training variables.

Type:: State

epochs

The maximum number of training epochs.

Type:: int

verbose

Whether to log epoch-level artifacts.

Type:: bool

device

The device used for training.

Type:: str

logger

A module used for logging messages.

Type:: logging.Logger

train_dloader

The DataLoader of the training set.

Type:: torch.DataLoader

validation_dloader

The DataLoader of the validation set.

Type:: torch.DataLoader

model

The BaseAudioClassifier to be trained.

Type:: BaseAudioClassifier

optimizer

The optimizer of the training process.

Type:: torch.optim.Optimizer

scheduler

The learning rate scheduler of the training process.

Type:: LRScheduler

loss_function

The loss function used for optimization.

Type:: nn.Module

callbacks

A list of callbacks used throughout the training lifecycle.

Type:: list

Initialize the Trainer.

Parameters:

train_dset (AudioClassificationDataset) – The training dataset.
model (BaseAudioClassifier) – The model to be trained.
validation_dset (AudioClassificationDataset | None) – The validation dataset. If None, a split is created from train_dset using train_ratio.
optimizer (Optimizer | None) – The optimizer used for training. Adam if None.
learning_rate (float) – Learning rate used when optimizer is None. Defaults to 1e-3.
lr_scheduler (LRScheduler | None) – The scheduler used for training. ReduceLROnPlateau if None.
loss_function (Module | None) – The loss function used for training. Uses CrossEntropy if None.
train_ratio (float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.
epochs (int) – The maximum number of training epochs. Defaults to 100.
patience (int | None) – Epochs to wait without loss improvement before stopping. Disabled if None.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
path_to_checkpoint (str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for training. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> trainer.train()

__init__(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]

Initialize the Trainer.

Parameters:

train_dset (AudioClassificationDataset) – The training dataset.
model (BaseAudioClassifier) – The model to be trained.
validation_dset (AudioClassificationDataset | None) – The validation dataset. If None, a split is created from train_dset using train_ratio.
optimizer (Optimizer | None) – The optimizer used for training. Adam if None.
learning_rate (float) – Learning rate used when optimizer is None. Defaults to 1e-3.
lr_scheduler (LRScheduler | None) – The scheduler used for training. ReduceLROnPlateau if None.
loss_function (Module | None) – The loss function used for training. Uses CrossEntropy if None.
train_ratio (float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.
epochs (int) – The maximum number of training epochs. Defaults to 100.
patience (int | None) – Epochs to wait without loss improvement before stopping. Disabled if None.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
path_to_checkpoint (str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for training. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> trainer.train()

epoch_step()[source]

Run one complete training epoch.

Logs the epoch header and metrics when verbose=True, calls train_step() and val_step(), updates the LR scheduler and self.state, then executes on_epoch_end callbacks (which may trigger early stopping or checkpointing).

Note

self.state.current_epoch must be set by the caller before invoking this method — train() does this automatically. When calling epoch_step() directly, set it yourself: trainer.state.current_epoch = epoch.

Returns:: (train_loss, val_loss) for the epoch.
Return type:: tuple[float, float]

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> for epoch in range(1, trainer.epochs + 1):
...     trainer.state.current_epoch = epoch
...     train_loss, val_loss = trainer.epoch_step()
...     print(f"Epoch {epoch} — train: {train_loss:.4f}, val: {val_loss:.4f}")
...     if trainer.state.early_stop:
...         break

train()[source]

Perform the full training process.

Epoch-level output is controlled by verbose. The training summary (best epoch, losses, checkpoint path) is always printed on completion.

Return type:: None

train_step()[source]

Run one pass over the training set.

Sets the model to train mode, iterates over train_dloader, performs forward + backward + optimizer step per batch.

Returns:: Average training loss over the epoch.
Return type:: float

val_step()[source]

Run one pass over the validation set.

Sets the model to eval mode, iterates over validation_dloader under torch.no_grad().

Returns:: Average validation loss over the epoch.
Return type:: float

class deepaudiox.Evaluator(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]

Bases: object

The core SDK module for testing a model.

The Evaluator assembles all modules required for testing and performs the testing process.

state

Stores testing variables.

Type:: State

verbose

Whether to log the evaluation report after testing.

Type:: bool

device

The device used for testing.

Type:: str

class_mapping

A mapping between class names and IDs.

Type:: dict

logger

A module used for logging messages.

Type:: logging.Logger

test_dloader

The DataLoader of the testing set.

Type:: torch.DataLoader

model

An AudioClassifier module inheriting from BaseAudioClassifier.

Type:: BaseAudioClassifier

callbacks

A list of callbacks used throughout the testing lifecycle.

Type:: list

Initialize the Evaluator.

Parameters:

test_dset (AudioClassificationDataset) – The testing dataset.
model (BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.
class_mapping (dict) – A mapping between class names and IDs.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for evaluation. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.

Example

>>> import torch
>>> from deepaudiox import AudioClassifier, Evaluator
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> test_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> model.load_state_dict(torch.load("checkpoint.pt"))
>>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping)
>>> evaluator.evaluate()

__init__(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]

Initialize the Evaluator.

Parameters:

test_dset (AudioClassificationDataset) – The testing dataset.
model (BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.
class_mapping (dict) – A mapping between class names and IDs.
batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.
num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.
device (Literal['cuda', 'mps', 'cpu']) – The device to use for evaluation. One of "cuda", "mps", or "cpu". Defaults to "cpu".
device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.
verbose (bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.

Example

>>> import torch
>>> from deepaudiox import AudioClassifier, Evaluator
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> test_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> model.load_state_dict(torch.load("checkpoint.pt"))
>>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping)
>>> evaluator.evaluate()

evaluate()[source]

Run the full evaluation loop over the test set.

Iterates over all batches in test_dloader, accumulates true labels, predicted labels, and posterior probabilities into self.state, then triggers the registered callbacks via on_testing_end.

Always prints “Evaluation has finished.” regardless of verbose. The Reporter callback (classification report, confusion matrix, average posteriors) is only executed when verbose=True.

Return type:: None

After this method returns, self.state holds:

y_true (np.ndarray): Ground-truth class indices, shape (N,).
y_pred (np.ndarray): Predicted class indices, shape (N,).
posteriors (np.ndarray): Max posterior probability per sample, shape (N,).

Note

The model is expected to already be in eval mode (set in __init__). Runs under torch.inference_mode() — gradients are fully disabled.

Base Classes & Inference

Base interfaces and inference helpers used across models.

BaseClasses for abstracting nn modules (e.g., backbones, pooling layers, classifiers)

class deepaudiox.modules.baseclasses.BaseAudioClassifier(*args, **kwargs)[source]

Bases: Module, ABC

Base class for creating custom audio classifiers.

This class defines the standard interface for audio classification models. Subclasses must implement the core initialization and forward pass. The built-in predict method provides a convenience wrapper to obtain predicted labels, posterior probabilities, and raw logits.

__init__()[source]: Initialize the classifier and its components.

forward()[source]: Process input waveforms and return logits.

predict()[source]: Compute predicted classes, posterior probabilities, and logits.

Initialize the audio classifier.

abstractmethod __init__(*args, **kwargs)[source]: Initialize the audio classifier.

abstractmethod forward(x)[source]

Pass the input through the model and return logits.

Parameters:: x (Tensor) – The input tensor.

inference_on_file(path, sample_rate, class_mapping, segment_duration=None, batch_size=4)[source]

Get prediction for an audio sample from a file path.

Parameters:

path (str | Path) – Path to an audio file supported by librosa (e.g., WAV or MP3).
sample_rate (int) – Sampling rate of audio sample.
class_mapping (dict[str, int]) – Class-to-index mapping as it is used by the model.
segment_duration (float | None) – Optional segment duration in seconds for segment-level inference. If provided, the last remainder is right-padded to a full segment.
batch_size (int) – Optional batch size for segment-level inference. Default is 4.

Returns:

A dictionary with keys:

final_label (str): Predicted class label.
final_posterior (float): Posterior probability for the predicted class.
segment_labels (list[str] | None): Per-segment labels when segmenting is used.
segment_posteriors (list[float] | None): Per-segment posteriors aligned with segment_labels when segmenting is used.

Return type:

dict

Example

>>> from deepaudiox import AudioClassifier
>>> class_mapping = {"speech": 0, "music": 1}
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> prediction = model.inference_on_file(
...     "path/to/audio.wav",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
...     batch_size=4,
... )

inference_on_waveform(x, sample_rate, class_mapping, segment_duration=None, batch_size=4)[source]

Get prediction on a waveform.

Parameters:

x (Tensor | ndarray) – Input waveform to be used for inference. Accepts shape (T,).
sample_rate (int) – Sampling rate of audio sample.
class_mapping (dict[str, int]) – Class-to-index mapping that is used by the model.
segment_duration (float | None) – Optional segment duration in seconds for segment-level inference. If provided, the last remainder is right-padded to a full segment.
batch_size (int) – Optional batch size for segment-level inference. Default is 4.

Returns:

A dictionary with keys:

final_label (str): Predicted class label.
final_posterior (float): Posterior probability for the predicted class.
segment_labels (list[str] | None): Per-segment labels when segmenting is used.
segment_posteriors (list[float] | None): Per-segment posteriors aligned with segment_labels when segmenting is used.

Return type:

dict

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> class_mapping = {"speech": 0, "music": 1}
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> waveform = torch.randn(5 * 16_000)
>>> prediction = model.inference_on_waveform(
...     waveform,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=1.0,
...     batch_size=4,
... )

predict(x)[source]

Compute predicted class and posterior probabilities.

This is a low-level method that does not manage model mode or gradient context. The caller is responsible for calling model.eval() and wrapping with torch.no_grad() or torch.inference_mode() as needed. For end-to-end inference with automatic mode management, use inference_on_waveform or inference_on_file instead.

Parameters:: x (Tensor) – Input waveforms of shape (B, T), where T is the number of audio samples.
Returns:: y_preds, posteriors, logits.
Return type:: dict[str, ndarray]

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> model.eval()
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> with torch.no_grad():
...     outputs = model.predict(waveforms)

class deepaudiox.modules.baseclasses.BaseBackbone(out_dim, sample_rate)[source]

Bases: Module, ABC

Abstract base class for all audio backbone models.

This class defines the common interface for backbone architectures that convert raw waveforms into fixed-dimensional embeddings. Subclasses must implement the core feature extraction and forward-processing logic.

__init__()[source]: Initializes the embedding dimension and the sample_rate of the audios.

forward()[source]: Computes embeddings from pre-extracted audio features.

extract_features()[source]: Converts raw waveforms into model-specific features.

forward_pipeline()[source]: Extracts features and then applies forward().

Initialize the BaseBackbone.

Parameters:

out_dim (int) – Output dim of the backbone feature map. For CNNs the embeddings are of shape (B, C, H, W)
shape (could be of)
shape
sample_rate (int) – Sample rate for audio input.

__init__(out_dim, sample_rate)[source]

Initialize the BaseBackbone.

Parameters:

out_dim (int) – Output dim of the backbone feature map. For CNNs the embeddings are of shape (B, C, H, W)
shape (could be of)
shape
sample_rate (int) – Sample rate for audio input.

abstractmethod extract_features(waveforms)[source]

Convert raw waveforms into internal acoustic features.

Parameters:: waveforms (Tensor) – Tensor of shape (B, T).
Returns:: Model-specific feature representation before final forward().
Return type:: Tensor

abstractmethod forward(x, padding_mask=None)[source]

Compute embeddings from input features.

Parameters:

x (Tensor) – Input audio-specific features of shape (B, 1, F, T) or (B, 1, T, F).
padding_mask (Tensor | None) – Optional padding mask.

Returns:

Embeddings of shape (B, N, D) or (B, D, H, W), where D is the embedding dimension.

Return type:

Tensor

forward_pipeline(x)[source]

Standard processing pipeline:

Extract features from raw audio

Pass features through forward()

Parameters:: x (Tensor) – Input waveforms of shape (B, T), where T is the length of waveforms.
Returns:: Final model output of shape (B, D, H, W) for CNNs or (B, N, D) for Transformers.
Return type:: Tensor

class deepaudiox.modules.baseclasses.BasePooling(in_dim=None)[source]

Bases: Module, ABC

Abstract base class for all pooling modules.

This class defines the interface for pooling that operate an input feature map obtained from a CNN or a Transformer BaseBackbone. Subclasses must implement the forward-processing logic. The input is expected to be a feature map of shape (B, D, H, W) for CNNs or (B, T, D) for Transformers.

__init__()[source]: Store input dimensionality.

forward()[source]: Apply the pooling module to an input tensor and return the result.

Initialize the BasePooling.

Parameters:: in_dim (int | None) – Input dimension. This is D for both CNNs and Transformers.

__init__(in_dim=None)[source]

Initialize the BasePooling.

Parameters:: in_dim (int | None) – Input dimension. This is D for both CNNs and Transformers.

abstractmethod forward(x)[source]

Compute forward pass returning a projected tensor.

Return type:: Tensor

Full Paths

The API re-exports the following symbols. If you prefer importing from the original modules, use these paths:

AudioClassifier -> deepaudiox.modules.constructors.AudioClassifierConstructor
Backbone -> deepaudiox.modules.constructors.BackboneConstructor
AudioClassificationDataset -> deepaudiox.datasets.audio_classification_dataset.AudioClassificationDataset
audio_classification_dataset_from_dir -> deepaudiox.datasets.audio_classification_dataset.audio_classification_dataset_from_dir
audio_classification_dataset_from_dictionary -> deepaudiox.datasets.audio_classification_dataset.audio_classification_dataset_from_dictionary
get_class_mapping_from_dir -> deepaudiox.utils.training_utils.get_class_mapping_from_dir
get_class_mapping_from_list -> deepaudiox.utils.training_utils.get_class_mapping_from_list
Trainer -> deepaudiox.loops.trainer.Trainer
Evaluator -> deepaudiox.loops.evaluator.Evaluator
BackboneName -> deepaudiox.schemas.types.BackboneName
PoolingName -> deepaudiox.schemas.types.PoolingName
AVAILABLE_BACKBONES -> deepaudiox.__init__.AVAILABLE_BACKBONES
AVAILABLE_POOLING -> deepaudiox.__init__.AVAILABLE_POOLING