API Reference

Dataset Construction

Methods for building datasets and class mappings from directories or label lists.

deepaudiox.get_class_mapping_from_dir(root_dir)[source]

Load the class mapping given a folder of class sub-folders.

Expected directory structure:

root_dir/
├── class_a/
│   ├── audio1.wav
│   └── audio2.wav
└── class_b/
    ├── audio3.wav
    └── audio4.wav
Parameters:

root_dir (str) – The path to root folder

Returns:

The class mapping dictionary, ordered alphabetically by folder name.

Return type:

dict[str, int]

Example

>>> from deepaudiox import get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> # Example output:
>>> # {'class_a': 0, 'class_b': 1}
deepaudiox.get_class_mapping_from_list(labels, sort_alphabetically=True)[source]

Get a class mapping dictionary given a list of class names.

Parameters:
  • labels (list[str]) – List of class names

  • sort_alphabetically (bool) – Determines if alphabetical sorting should be applied to class names.

Returns:

The class mapping dictionary

Return type:

dict[str, int]

Example

>>> from deepaudiox import get_class_mapping_from_list
>>> labels = ["speech", "music", "noise"]
>>> class_mapping = get_class_mapping_from_list(labels, sort_alphabetically=True)
>>> # Example output:
>>> # {'music': 0, 'noise': 1, 'speech': 2}
class deepaudiox.AudioClassificationDataset(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]

Bases: Dataset

PyTorch Dataset for audio classification tasks.

This dataset loads audio files and returns a dictionary containing the raw waveform (under the key "feature"), the corresponding class name, and the integer class ID defined in class_mapping. The file_to_class_mapping argument must be a dictionary of the form:

{"abs/path/to/audio.wav": "class_name"}

Optionally, the dataset can segment each audio file into fixed-duration chunks using segment_duration. When enabled, each segment becomes an individual dataset sample.

file_to_class_mapping

Mapping from file paths to class names.

Type:

dict

sample_rate

Target sampling rate for audio loading.

Type:

int

class_mapping

Mapping from string class labels to integer IDs.

Type:

dict

Initialize the dataset.

Parameters:
  • file_to_class_mapping (dict[str | PathLike, str]) – Mapping from file paths to class names.

  • sample_rate (int) – Target sampling rate for audio loading.

  • class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.

  • segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Example

>>> from deepaudiox import AudioClassificationDataset
>>> file_to_class_mapping = {
...     "path/to/audio1.wav": "speech",
...     "path/to/audio2.wav": "music",
... }
>>> class_mapping = {"speech": 0, "music": 1}
>>> dataset = AudioClassificationDataset(
...     file_to_class_mapping=file_to_class_mapping,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
... )
deepaudiox.audio_classification_dataset_from_dir(root_dir, sample_rate, class_mapping, segment_duration=None)[source]

Create an AudioClassificationDataset from a directory structure.

Parameters:
  • root_dir (str) – Root directory containing class sub-folders. Only .wav and .mp3 files are used.

  • sample_rate (int) – Target sampling rate for audio loading.

  • class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.

  • segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Returns:

The constructed dataset.

Return type:

AudioClassificationDataset

Example

>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
... )
deepaudiox.audio_classification_dataset_from_dictionary(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]

Create an AudioClassificationDataset from a file-to-class mapping dictionary.

Parameters:
  • file_to_class_mapping (dict[str | PathLike, str]) – Mapping from file paths to class names.

  • sample_rate (int) – Target sampling rate for audio loading.

  • class_mapping (dict[str, int]) – Mapping from string labels to integer IDs.

  • segment_duration (float | None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.

Returns:

The constructed dataset.

Return type:

AudioClassificationDataset

Example

>>> from deepaudiox import audio_classification_dataset_from_dictionary
>>> file_to_class_mapping = {
...     "path/to/audio1.wav": "speech",
...     "path/to/audio2.wav": "music",
... }
>>> class_mapping = {"speech": 0, "music": 1}
>>> dataset = audio_classification_dataset_from_dictionary(
...     file_to_class_mapping=file_to_class_mapping,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=None,
... )

Models & Backbones

Constructors for initializing classifiers and backbones.

class deepaudiox.modules.constructors.AudioClassifierConstructor(num_classes, backbone, pooling=None, freeze_backbone=False, sample_rate=16000, classifier_hidden_layers=None, activation='relu', apply_batch_norm=True, pretrained=False)[source]

Bases: BaseAudioClassifier, BackbonePoolingResolverMixin

Classifier model using a backbone for feature extraction.

backbone_constructor

Backbone model with optional pooling method.

Type:

BackboneConstructor

classifier

Classifier head for final predictions.

Type:

MLPHead

config

Constructor arguments used to build this model. Used by from_checkpoint to reconstruct the model from a saved checkpoint.

Type:

dict

Initialize the AudioClassifierConstructor.

Parameters:
  • num_classes (int) – Number of output classes.

  • backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone model to use for feature extraction. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.

  • pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer to aggregate features.

  • freeze_backbone (bool) – Whether to freeze the backbone weights during training.

  • sample_rate (int) – Sample frequency for audio input.

  • classifier_hidden_layers (list[int] | None) – Hidden layer sizes for the classifier head.

  • activation (Literal['relu', 'gelu', 'tanh', 'leakyrelu']) – Activation function for the classifier head.

  • apply_batch_norm (bool) – Whether to apply batch normalization in the classifier head.

  • pretrained (bool) – Whether to load pretrained weights for the backbone. If pooling is None, GAP is used by default.

Example

>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(
...     num_classes=10,
...     backbone="beats",
...     pooling=None,
...     freeze_backbone=True,
...     sample_rate=16000,
...     classifier_hidden_layers=[512, 256],
...     activation="relu",
...     apply_batch_norm=True,
...     pretrained=True,
... )

Note

Available as deepaudiox.AudioClassifier.

__init__(num_classes, backbone, pooling=None, freeze_backbone=False, sample_rate=16000, classifier_hidden_layers=None, activation='relu', apply_batch_norm=True, pretrained=False)[source]

Initialize the AudioClassifierConstructor.

Parameters:
  • num_classes (int) – Number of output classes.

  • backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone model to use for feature extraction. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.

  • pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer to aggregate features.

  • freeze_backbone (bool) – Whether to freeze the backbone weights during training.

  • sample_rate (int) – Sample frequency for audio input.

  • classifier_hidden_layers (list[int] | None) – Hidden layer sizes for the classifier head.

  • activation (Literal['relu', 'gelu', 'tanh', 'leakyrelu']) – Activation function for the classifier head.

  • apply_batch_norm (bool) – Whether to apply batch normalization in the classifier head.

  • pretrained (bool) – Whether to load pretrained weights for the backbone. If pooling is None, GAP is used by default.

Example

>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(
...     num_classes=10,
...     backbone="beats",
...     pooling=None,
...     freeze_backbone=True,
...     sample_rate=16000,
...     classifier_hidden_layers=[512, 256],
...     activation="relu",
...     apply_batch_norm=True,
...     pretrained=True,
... )
forward(x)[source]

Forward pass through the classifier.

Parameters:

x (torch.Tensor) – Input waveforms of shape (B, T)

Returns:

Logits of shape (B, num_classes)

Return type:

Tensor

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> logits = model.forward(waveforms)
>>> # logits shape: (B, num_classes)
forward_backbone(x)[source]

Extract feature map from the backbone.

Parameters:

x (torch.Tensor) – Input waveforms of shape (B, T).

Returns:

Returns the feature map of the backbone model (B, T, D) or (B, D, H, W).

Return type:

Tensor

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> features = model.forward_backbone(waveforms)
>>> # features shape: (B, N, D) for Transformer or (B, D, H, W) for CNN backbones
forward_with_pooling(x)[source]

Forward pass through backbone and pooling.

Parameters:

x (Tensor) – x (torch.Tensor): Input waveforms of shape (B, T).

Returns:

Pooled tensor of shape (B, D).

Return type:

Tensor

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> embeddings = model.forward_with_pooling(waveforms)
>>> # embeddings shape: (B, D)
classmethod from_checkpoint(path)[source]

Load an AudioClassifierConstructor from a checkpoint saved by the Checkpointer.

Parameters:

path (str) – Path to the checkpoint file.

Returns:

Model with weights and config restored.

Return type:

AudioClassifierConstructor

Example

>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier.from_checkpoint("checkpoint.pt")
>>> print(model.config)
class deepaudiox.modules.constructors.BackboneConstructor(backbone, pretrained=False, freeze_backbone=False, pooling=None, sample_rate=16000, norm_p=None)[source]

Bases: Module, BackbonePoolingResolverMixin

Backbone model wrapper with optional pooling and normalization.

backbone

Backbone model for feature extraction.

Type:

BaseBackbone

pooling

Pooling layer applied to the backbone feature map.

Type:

BasePooling

norm_p

Optional Lp normalization applied after pooling.

Type:

float or None

out_dim

Dimension of the backbone model feature map.

Type:

int

config

Constructor arguments used to build this model. Used by from_checkpoint to reconstruct the model from a saved checkpoint.

Type:

dict

Initialize the BackboneConstructor.

Parameters:
  • backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone name or instance. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.

  • pretrained (bool) – Whether to load pretrained weights for the backbone.

  • freeze_backbone (bool) – Whether to freeze the backbone weights during training.

  • pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer for aggregation.

  • sample_rate (int) – Sample frequency for audio input.

  • norm_p (float | None) – Optional Lp norm applied after pooling. If pooling is None, GAP is used.

Example

>>> from deepaudiox import Backbone
>>> backbone = Backbone(
...     backbone="beats",
...     pretrained=True,
...     freeze_backbone=True,
...     pooling="gap",
...     sample_rate=16000,
...     norm_p=2.0,
... )

Note

Available as deepaudiox.Backbone.

__init__(backbone, pretrained=False, freeze_backbone=False, pooling=None, sample_rate=16000, norm_p=None)[source]

Initialize the BackboneConstructor.

Parameters:
  • backbone (Union[Literal['beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as'], BaseBackbone]) – Backbone name or instance. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.

  • pretrained (bool) – Whether to load pretrained weights for the backbone.

  • freeze_backbone (bool) – Whether to freeze the backbone weights during training.

  • pooling (Union[Literal['gap', 'simpool', 'ep'], BasePooling, None]) – Optional pooling layer for aggregation.

  • sample_rate (int) – Sample frequency for audio input.

  • norm_p (float | None) – Optional Lp norm applied after pooling. If pooling is None, GAP is used.

Example

>>> from deepaudiox import Backbone
>>> backbone = Backbone(
...     backbone="beats",
...     pretrained=True,
...     freeze_backbone=True,
...     pooling="gap",
...     sample_rate=16000,
...     norm_p=2.0,
... )
extract_features(waveforms)[source]

Extract backbone-specific features from raw waveforms.

Parameters:

waveforms (Tensor) – Input waveforms of shape (B, T).

Returns:

Model-specific input features.

Return type:

Tensor

forward(x)[source]

Forward pass through the backbone.

Parameters:

x (Tensor) – Input waveforms of shape (B, T).

Returns:

Backbone feature map of shape (B, N, D) or (B, D, H, W).

Return type:

Tensor

Example

>>> import torch
>>> from deepaudiox import Backbone
>>> backbone = Backbone(backbone="beats", pretrained=True, sample_rate=16_000)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> features = backbone.forward(waveforms)
>>> # features shape: (B, N, D) for Transformer or (B, D, H, W) for CNN backbones
forward_with_pooling(x)[source]

Forward pass through backbone and pooling (with optional normalization).

Parameters:

x (Tensor) – Input waveforms of shape (B, T).

Returns:

Pooled tensor of shape (B, D).

Return type:

Tensor

Example

>>> import torch
>>> from deepaudiox import Backbone
>>> backbone = Backbone(backbone="beats", pretrained=True, pooling="gap", sample_rate=16_000)
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> embeddings = backbone.forward_with_pooling(waveforms)
>>> # embeddings shape: (B, D)
classmethod from_checkpoint(path)[source]

Load a BackboneConstructor from a checkpoint saved by the Checkpointer.

Parameters:

path (str) – Path to the checkpoint file.

Returns:

Model with weights and config restored.

Return type:

BackboneConstructor

Example

>>> from deepaudiox import Backbone
>>> backbone = Backbone.from_checkpoint("checkpoint.pt")
>>> print(backbone.config)

Supported Backbones & Pooling

Type aliases and runtime constants for valid backbone and pooling names.

deepaudiox.AVAILABLE_BACKBONES = ("beats", "passt", "mobilenet_05_as", "mobilenet_10_as", "mobilenet_40_as")

Supported pretrained backbone names available at runtime.

deepaudiox.AVAILABLE_POOLING = ("gap", "simpool", "ep")

Supported pooling layer names available at runtime.

deepaudiox.BackboneName

Type alias: Literal["beats", "passt", "mobilenet_05_as", "mobilenet_10_as", "mobilenet_40_as"]. Use for type-annotated code.

deepaudiox.PoolingName

Type alias: Literal["gap", "simpool", "ep"]. Use for type-annotated code.

Training & Evaluation

Interfaces for training models and evaluating performance on held-out data.

class deepaudiox.Trainer(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]

Bases: object

The core SDK module for training a model.

The Trainer assembles all modules required for training and performs the training process.

state

Stores training variables.

Type:

State

epochs

The maximum number of training epochs.

Type:

int

verbose

Whether to log epoch-level artifacts.

Type:

bool

device

The device used for training.

Type:

str

logger

A module used for logging messages.

Type:

logging.Logger

train_dloader

The DataLoader of the training set.

Type:

torch.DataLoader

validation_dloader

The DataLoader of the validation set.

Type:

torch.DataLoader

model

The BaseAudioClassifier to be trained.

Type:

BaseAudioClassifier

optimizer

The optimizer of the training process.

Type:

torch.optim.Optimizer

scheduler

The learning rate scheduler of the training process.

Type:

LRScheduler

loss_function

The loss function used for optimization.

Type:

nn.Module

callbacks

A list of callbacks used throughout the training lifecycle.

Type:

list

Initialize the Trainer.

Parameters:
  • train_dset (AudioClassificationDataset) – The training dataset.

  • model (BaseAudioClassifier) – The model to be trained.

  • validation_dset (AudioClassificationDataset | None) – The validation dataset. If None, a split is created from train_dset using train_ratio.

  • optimizer (Optimizer | None) – The optimizer used for training. Adam if None.

  • learning_rate (float) – Learning rate used when optimizer is None. Defaults to 1e-3.

  • lr_scheduler (LRScheduler | None) – The scheduler used for training. ReduceLROnPlateau if None.

  • loss_function (Module | None) – The loss function used for training. Uses CrossEntropy if None.

  • train_ratio (float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.

  • epochs (int) – The maximum number of training epochs. Defaults to 100.

  • patience (int | None) – Epochs to wait without loss improvement before stopping. Disabled if None.

  • num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.

  • batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.

  • path_to_checkpoint (str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.

  • device (Literal['cuda', 'mps', 'cpu']) – The device to use for training. One of "cuda", "mps", or "cpu". Defaults to "cpu".

  • device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.

  • verbose (bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> trainer.train()
__init__(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]

Initialize the Trainer.

Parameters:
  • train_dset (AudioClassificationDataset) – The training dataset.

  • model (BaseAudioClassifier) – The model to be trained.

  • validation_dset (AudioClassificationDataset | None) – The validation dataset. If None, a split is created from train_dset using train_ratio.

  • optimizer (Optimizer | None) – The optimizer used for training. Adam if None.

  • learning_rate (float) – Learning rate used when optimizer is None. Defaults to 1e-3.

  • lr_scheduler (LRScheduler | None) – The scheduler used for training. ReduceLROnPlateau if None.

  • loss_function (Module | None) – The loss function used for training. Uses CrossEntropy if None.

  • train_ratio (float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.

  • epochs (int) – The maximum number of training epochs. Defaults to 100.

  • patience (int | None) – Epochs to wait without loss improvement before stopping. Disabled if None.

  • num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.

  • batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.

  • path_to_checkpoint (str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.

  • device (Literal['cuda', 'mps', 'cpu']) – The device to use for training. One of "cuda", "mps", or "cpu". Defaults to "cpu".

  • device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.

  • verbose (bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> trainer.train()
epoch_step()[source]

Run one complete training epoch.

Logs the epoch header and metrics when verbose=True, calls train_step() and val_step(), updates the LR scheduler and self.state, then executes on_epoch_end callbacks (which may trigger early stopping or checkpointing).

Note

self.state.current_epoch must be set by the caller before invoking this method — train() does this automatically. When calling epoch_step() directly, set it yourself: trainer.state.current_epoch = epoch.

Returns:

(train_loss, val_loss) for the epoch.

Return type:

tuple[float, float]

Example

>>> from deepaudiox import AudioClassifier, Trainer
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> train_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10)
>>> for epoch in range(1, trainer.epochs + 1):
...     trainer.state.current_epoch = epoch
...     train_loss, val_loss = trainer.epoch_step()
...     print(f"Epoch {epoch} — train: {train_loss:.4f}, val: {val_loss:.4f}")
...     if trainer.state.early_stop:
...         break
train()[source]

Perform the full training process.

Epoch-level output is controlled by verbose. The training summary (best epoch, losses, checkpoint path) is always printed on completion.

Return type:

None

train_step()[source]

Run one pass over the training set.

Sets the model to train mode, iterates over train_dloader, performs forward + backward + optimizer step per batch.

Returns:

Average training loss over the epoch.

Return type:

float

val_step()[source]

Run one pass over the validation set.

Sets the model to eval mode, iterates over validation_dloader under torch.no_grad().

Returns:

Average validation loss over the epoch.

Return type:

float

class deepaudiox.Evaluator(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]

Bases: object

The core SDK module for testing a model.

The Evaluator assembles all modules required for testing and performs the testing process.

state

Stores testing variables.

Type:

State

verbose

Whether to log the evaluation report after testing.

Type:

bool

device

The device used for testing.

Type:

str

class_mapping

A mapping between class names and IDs.

Type:

dict

logger

A module used for logging messages.

Type:

logging.Logger

test_dloader

The DataLoader of the testing set.

Type:

torch.DataLoader

model

An AudioClassifier module inheriting from BaseAudioClassifier.

Type:

BaseAudioClassifier

callbacks

A list of callbacks used throughout the testing lifecycle.

Type:

list

Initialize the Evaluator.

Parameters:
  • test_dset (AudioClassificationDataset) – The testing dataset.

  • model (BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.

  • class_mapping (dict) – A mapping between class names and IDs.

  • batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.

  • num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.

  • device (Literal['cuda', 'mps', 'cpu']) – The device to use for evaluation. One of "cuda", "mps", or "cpu". Defaults to "cpu".

  • device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.

  • verbose (bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.

Example

>>> import torch
>>> from deepaudiox import AudioClassifier, Evaluator
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> test_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> model.load_state_dict(torch.load("checkpoint.pt"))
>>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping)
>>> evaluator.evaluate()
__init__(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]

Initialize the Evaluator.

Parameters:
  • test_dset (AudioClassificationDataset) – The testing dataset.

  • model (BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.

  • class_mapping (dict) – A mapping between class names and IDs.

  • batch_size (int) – The batch size for Python Data Loaders. Defaults to 16.

  • num_workers (int) – The number of workers for Python Data Loaders. Defaults to 4.

  • device (Literal['cuda', 'mps', 'cpu']) – The device to use for evaluation. One of "cuda", "mps", or "cpu". Defaults to "cpu".

  • device_index (int | None) – The GPU device index. Only applicable when device="cuda". If None, uses the default CUDA device.

  • verbose (bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.

Example

>>> import torch
>>> from deepaudiox import AudioClassifier, Evaluator
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir
>>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")
>>> test_dataset = audio_classification_dataset_from_dir(
...     root_dir="path/to/data",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
... )
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> model.load_state_dict(torch.load("checkpoint.pt"))
>>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping)
>>> evaluator.evaluate()
evaluate()[source]

Run the full evaluation loop over the test set.

Iterates over all batches in test_dloader, accumulates true labels, predicted labels, and posterior probabilities into self.state, then triggers the registered callbacks via on_testing_end.

Always prints “Evaluation has finished.” regardless of verbose. The Reporter callback (classification report, confusion matrix, average posteriors) is only executed when verbose=True.

Return type:

None

After this method returns, self.state holds:
  • y_true (np.ndarray): Ground-truth class indices, shape (N,).

  • y_pred (np.ndarray): Predicted class indices, shape (N,).

  • posteriors (np.ndarray): Max posterior probability per sample, shape (N,).

Note

The model is expected to already be in eval mode (set in __init__). Runs under torch.inference_mode() — gradients are fully disabled.

Base Classes & Inference

Base interfaces and inference helpers used across models.

BaseClasses for abstracting nn modules (e.g., backbones, pooling layers, classifiers)

class deepaudiox.modules.baseclasses.BaseAudioClassifier(*args, **kwargs)[source]

Bases: Module, ABC

Base class for creating custom audio classifiers.

This class defines the standard interface for audio classification models. Subclasses must implement the core initialization and forward pass. The built-in predict method provides a convenience wrapper to obtain predicted labels, posterior probabilities, and raw logits.

__init__()[source]

Initialize the classifier and its components.

forward()[source]

Process input waveforms and return logits.

predict()[source]

Compute predicted classes, posterior probabilities, and logits.

Initialize the audio classifier.

abstractmethod __init__(*args, **kwargs)[source]

Initialize the audio classifier.

abstractmethod forward(x)[source]

Pass the input through the model and return logits.

Parameters:

x (Tensor) – The input tensor.

inference_on_file(path, sample_rate, class_mapping, segment_duration=None, batch_size=4)[source]

Get prediction for an audio sample from a file path.

Parameters:
  • path (str | Path) – Path to an audio file supported by librosa (e.g., WAV or MP3).

  • sample_rate (int) – Sampling rate of audio sample.

  • class_mapping (dict[str, int]) – Class-to-index mapping as it is used by the model.

  • segment_duration (float | None) – Optional segment duration in seconds for segment-level inference. If provided, the last remainder is right-padded to a full segment.

  • batch_size (int) – Optional batch size for segment-level inference. Default is 4.

Returns:

A dictionary with keys:
  • final_label (str): Predicted class label.

  • final_posterior (float): Posterior probability for the predicted class.

  • segment_labels (list[str] | None): Per-segment labels when segmenting is used.

  • segment_posteriors (list[float] | None): Per-segment posteriors aligned with segment_labels when segmenting is used.

Return type:

dict

Example

>>> from deepaudiox import AudioClassifier
>>> class_mapping = {"speech": 0, "music": 1}
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> prediction = model.inference_on_file(
...     "path/to/audio.wav",
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=2.0,
...     batch_size=4,
... )
inference_on_waveform(x, sample_rate, class_mapping, segment_duration=None, batch_size=4)[source]

Get prediction on a waveform.

Parameters:
  • x (Tensor | ndarray) – Input waveform to be used for inference. Accepts shape (T,).

  • sample_rate (int) – Sampling rate of audio sample.

  • class_mapping (dict[str, int]) – Class-to-index mapping that is used by the model.

  • segment_duration (float | None) – Optional segment duration in seconds for segment-level inference. If provided, the last remainder is right-padded to a full segment.

  • batch_size (int) – Optional batch size for segment-level inference. Default is 4.

Returns:

A dictionary with keys:
  • final_label (str): Predicted class label.

  • final_posterior (float): Posterior probability for the predicted class.

  • segment_labels (list[str] | None): Per-segment labels when segmenting is used.

  • segment_posteriors (list[float] | None): Per-segment posteriors aligned with segment_labels when segmenting is used.

Return type:

dict

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> class_mapping = {"speech": 0, "music": 1}
>>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000)
>>> waveform = torch.randn(5 * 16_000)
>>> prediction = model.inference_on_waveform(
...     waveform,
...     sample_rate=16_000,
...     class_mapping=class_mapping,
...     segment_duration=1.0,
...     batch_size=4,
... )
predict(x)[source]

Compute predicted class and posterior probabilities.

This is a low-level method that does not manage model mode or gradient context. The caller is responsible for calling model.eval() and wrapping with torch.no_grad() or torch.inference_mode() as needed. For end-to-end inference with automatic mode management, use inference_on_waveform or inference_on_file instead.

Parameters:

x (Tensor) – Input waveforms of shape (B, T), where T is the number of audio samples.

Returns:

y_preds, posteriors, logits.

Return type:

dict[str, ndarray]

Example

>>> import torch
>>> from deepaudiox import AudioClassifier
>>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True)
>>> model.eval()
>>> waveforms = torch.randn(2, 5 * 16_000)
>>> with torch.no_grad():
...     outputs = model.predict(waveforms)
class deepaudiox.modules.baseclasses.BaseBackbone(out_dim, sample_rate)[source]

Bases: Module, ABC

Abstract base class for all audio backbone models.

This class defines the common interface for backbone architectures that convert raw waveforms into fixed-dimensional embeddings. Subclasses must implement the core feature extraction and forward-processing logic.

__init__()[source]

Initializes the embedding dimension and the sample_rate of the audios.

forward()[source]

Computes embeddings from pre-extracted audio features.

extract_features()[source]

Converts raw waveforms into model-specific features.

forward_pipeline()[source]

Extracts features and then applies forward().

Initialize the BaseBackbone.

Parameters:
  • out_dim (int) – Output dim of the backbone feature map. For CNNs the embeddings are of shape (B, C, H, W)

  • shape (could be of)

  • shape

  • sample_rate (int) – Sample rate for audio input.

__init__(out_dim, sample_rate)[source]

Initialize the BaseBackbone.

Parameters:
  • out_dim (int) – Output dim of the backbone feature map. For CNNs the embeddings are of shape (B, C, H, W)

  • shape (could be of)

  • shape

  • sample_rate (int) – Sample rate for audio input.

abstractmethod extract_features(waveforms)[source]

Convert raw waveforms into internal acoustic features.

Parameters:

waveforms (Tensor) – Tensor of shape (B, T).

Returns:

Model-specific feature representation before final forward().

Return type:

Tensor

abstractmethod forward(x, padding_mask=None)[source]

Compute embeddings from input features.

Parameters:
  • x (Tensor) – Input audio-specific features of shape (B, 1, F, T) or (B, 1, T, F).

  • padding_mask (Tensor | None) – Optional padding mask.

Returns:

Embeddings of shape (B, N, D) or (B, D, H, W), where D is the embedding dimension.

Return type:

Tensor

forward_pipeline(x)[source]

Standard processing pipeline:

  1. Extract features from raw audio

  2. Pass features through forward()

Parameters:

x (Tensor) – Input waveforms of shape (B, T), where T is the length of waveforms.

Returns:

Final model output of shape (B, D, H, W) for CNNs or (B, N, D) for Transformers.

Return type:

Tensor

class deepaudiox.modules.baseclasses.BasePooling(in_dim=None)[source]

Bases: Module, ABC

Abstract base class for all pooling modules.

This class defines the interface for pooling that operate an input feature map obtained from a CNN or a Transformer BaseBackbone. Subclasses must implement the forward-processing logic. The input is expected to be a feature map of shape (B, D, H, W) for CNNs or (B, T, D) for Transformers.

__init__()[source]

Store input dimensionality.

forward()[source]

Apply the pooling module to an input tensor and return the result.

Initialize the BasePooling.

Parameters:

in_dim (int | None) – Input dimension. This is D for both CNNs and Transformers.

__init__(in_dim=None)[source]

Initialize the BasePooling.

Parameters:

in_dim (int | None) – Input dimension. This is D for both CNNs and Transformers.

abstractmethod forward(x)[source]

Compute forward pass returning a projected tensor.

Return type:

Tensor

Full Paths

The API re-exports the following symbols. If you prefer importing from the original modules, use these paths:

  • AudioClassifier -> deepaudiox.modules.constructors.AudioClassifierConstructor

  • Backbone -> deepaudiox.modules.constructors.BackboneConstructor

  • AudioClassificationDataset -> deepaudiox.datasets.audio_classification_dataset.AudioClassificationDataset

  • audio_classification_dataset_from_dir -> deepaudiox.datasets.audio_classification_dataset.audio_classification_dataset_from_dir

  • audio_classification_dataset_from_dictionary -> deepaudiox.datasets.audio_classification_dataset.audio_classification_dataset_from_dictionary

  • get_class_mapping_from_dir -> deepaudiox.utils.training_utils.get_class_mapping_from_dir

  • get_class_mapping_from_list -> deepaudiox.utils.training_utils.get_class_mapping_from_list

  • Trainer -> deepaudiox.loops.trainer.Trainer

  • Evaluator -> deepaudiox.loops.evaluator.Evaluator

  • BackboneName -> deepaudiox.schemas.types.BackboneName

  • PoolingName -> deepaudiox.schemas.types.PoolingName

  • AVAILABLE_BACKBONES -> deepaudiox.__init__.AVAILABLE_BACKBONES

  • AVAILABLE_POOLING -> deepaudiox.__init__.AVAILABLE_POOLING