API Reference
Dataset Construction
Methods for building datasets and class mappings from directories or label lists.
- deepaudiox.get_class_mapping_from_dir(root_dir)[source]
Load the class mapping given a folder of class sub-folders.
Expected directory structure:
root_dir/ ├── class_a/ │ ├── audio1.wav │ └── audio2.wav └── class_b/ ├── audio3.wav └── audio4.wav- Parameters:
root_dir (
str) – The path to root folder- Returns:
The class mapping dictionary, ordered alphabetically by folder name.
- Return type:
dict[str,int]
Example
>>> from deepaudiox import get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> # Example output: >>> # {'class_a': 0, 'class_b': 1}
- deepaudiox.get_class_mapping_from_list(labels, sort_alphabetically=True)[source]
Get a class mapping dictionary given a list of class names.
- Parameters:
labels (
list[str]) – List of class namessort_alphabetically (
bool) – Determines if alphabetical sorting should be applied to class names.
- Returns:
The class mapping dictionary
- Return type:
dict[str,int]
Example
>>> from deepaudiox import get_class_mapping_from_list >>> labels = ["speech", "music", "noise"] >>> class_mapping = get_class_mapping_from_list(labels, sort_alphabetically=True) >>> # Example output: >>> # {'music': 0, 'noise': 1, 'speech': 2}
- class deepaudiox.AudioClassificationDataset(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]
Bases:
DatasetPyTorch Dataset for audio classification tasks.
This dataset loads audio files and returns a dictionary containing the raw waveform (under the key
"feature"), the corresponding class name, and the integer class ID defined inclass_mapping. Thefile_to_class_mappingargument must be a dictionary of the form:{"abs/path/to/audio.wav": "class_name"}
Optionally, the dataset can segment each audio file into fixed-duration chunks using
segment_duration. When enabled, each segment becomes an individual dataset sample.- file_to_class_mapping
Mapping from file paths to class names.
- Type:
dict
- sample_rate
Target sampling rate for audio loading.
- Type:
int
- class_mapping
Mapping from string class labels to integer IDs.
- Type:
dict
Initialize the dataset.
- Parameters:
file_to_class_mapping (
dict[str|PathLike,str]) – Mapping from file paths to class names.sample_rate (
int) – Target sampling rate for audio loading.class_mapping (
dict[str,int]) – Mapping from string labels to integer IDs.segment_duration (
float|None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.
Example
>>> from deepaudiox import AudioClassificationDataset >>> file_to_class_mapping = { ... "path/to/audio1.wav": "speech", ... "path/to/audio2.wav": "music", ... } >>> class_mapping = {"speech": 0, "music": 1} >>> dataset = AudioClassificationDataset( ... file_to_class_mapping=file_to_class_mapping, ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=2.0, ... )
- deepaudiox.audio_classification_dataset_from_dir(root_dir, sample_rate, class_mapping, segment_duration=None)[source]
Create an AudioClassificationDataset from a directory structure.
- Parameters:
root_dir (
str) – Root directory containing class sub-folders. Only.wavand.mp3files are used.sample_rate (
int) – Target sampling rate for audio loading.class_mapping (
dict[str,int]) – Mapping from string labels to integer IDs.segment_duration (
float|None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.
- Returns:
The constructed dataset.
- Return type:
Example
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=2.0, ... )
- deepaudiox.audio_classification_dataset_from_dictionary(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]
Create an AudioClassificationDataset from a file-to-class mapping dictionary.
- Parameters:
file_to_class_mapping (
dict[str|PathLike,str]) – Mapping from file paths to class names.sample_rate (
int) – Target sampling rate for audio loading.class_mapping (
dict[str,int]) – Mapping from string labels to integer IDs.segment_duration (
float|None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.
- Returns:
The constructed dataset.
- Return type:
Example
>>> from deepaudiox import audio_classification_dataset_from_dictionary >>> file_to_class_mapping = { ... "path/to/audio1.wav": "speech", ... "path/to/audio2.wav": "music", ... } >>> class_mapping = {"speech": 0, "music": 1} >>> dataset = audio_classification_dataset_from_dictionary( ... file_to_class_mapping=file_to_class_mapping, ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=None, ... )
Models & Backbones
Constructors for initializing classifiers and backbones.
- class deepaudiox.modules.constructors.AudioClassifierConstructor(num_classes, backbone, pooling=None, freeze_backbone=False, sample_rate=16000, classifier_hidden_layers=None, activation='relu', apply_batch_norm=True, pretrained=False)[source]
Bases:
BaseAudioClassifier,BackbonePoolingResolverMixinClassifier model using a backbone for feature extraction.
- backbone_constructor
Backbone model with optional pooling method.
- Type:
- classifier
Classifier head for final predictions.
- Type:
MLPHead
- config
Constructor arguments used to build this model. Used by
from_checkpointto reconstruct the model from a saved checkpoint.- Type:
dict
Initialize the AudioClassifierConstructor.
- Parameters:
num_classes (
int) – Number of output classes.backbone (
Union[Literal['beats','passt','mobilenet_05_as','mobilenet_10_as','mobilenet_40_as'],BaseBackbone]) – Backbone model to use for feature extraction. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.pooling (
Union[Literal['gap','simpool','ep'],BasePooling,None]) – Optional pooling layer to aggregate features.freeze_backbone (
bool) – Whether to freeze the backbone weights during training.sample_rate (
int) – Sample frequency for audio input.classifier_hidden_layers (
list[int] |None) – Hidden layer sizes for the classifier head.activation (
Literal['relu','gelu','tanh','leakyrelu']) – Activation function for the classifier head.apply_batch_norm (
bool) – Whether to apply batch normalization in the classifier head.pretrained (
bool) – Whether to load pretrained weights for the backbone. If pooling is None, GAP is used by default.
Example
>>> from deepaudiox import AudioClassifier >>> model = AudioClassifier( ... num_classes=10, ... backbone="beats", ... pooling=None, ... freeze_backbone=True, ... sample_rate=16000, ... classifier_hidden_layers=[512, 256], ... activation="relu", ... apply_batch_norm=True, ... pretrained=True, ... )
Note
Available as
deepaudiox.AudioClassifier.- __init__(num_classes, backbone, pooling=None, freeze_backbone=False, sample_rate=16000, classifier_hidden_layers=None, activation='relu', apply_batch_norm=True, pretrained=False)[source]
Initialize the AudioClassifierConstructor.
- Parameters:
num_classes (
int) – Number of output classes.backbone (
Union[Literal['beats','passt','mobilenet_05_as','mobilenet_10_as','mobilenet_40_as'],BaseBackbone]) – Backbone model to use for feature extraction. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.pooling (
Union[Literal['gap','simpool','ep'],BasePooling,None]) – Optional pooling layer to aggregate features.freeze_backbone (
bool) – Whether to freeze the backbone weights during training.sample_rate (
int) – Sample frequency for audio input.classifier_hidden_layers (
list[int] |None) – Hidden layer sizes for the classifier head.activation (
Literal['relu','gelu','tanh','leakyrelu']) – Activation function for the classifier head.apply_batch_norm (
bool) – Whether to apply batch normalization in the classifier head.pretrained (
bool) – Whether to load pretrained weights for the backbone. If pooling is None, GAP is used by default.
Example
>>> from deepaudiox import AudioClassifier >>> model = AudioClassifier( ... num_classes=10, ... backbone="beats", ... pooling=None, ... freeze_backbone=True, ... sample_rate=16000, ... classifier_hidden_layers=[512, 256], ... activation="relu", ... apply_batch_norm=True, ... pretrained=True, ... )
- forward(x)[source]
Forward pass through the classifier.
- Parameters:
x (torch.Tensor) – Input waveforms of shape (B, T)
- Returns:
Logits of shape (B, num_classes)
- Return type:
Tensor
Example
>>> import torch >>> from deepaudiox import AudioClassifier >>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True) >>> waveforms = torch.randn(2, 5 * 16_000) >>> logits = model.forward(waveforms) >>> # logits shape: (B, num_classes)
- forward_backbone(x)[source]
Extract feature map from the backbone.
- Parameters:
x (torch.Tensor) – Input waveforms of shape (B, T).
- Returns:
Returns the feature map of the backbone model (B, T, D) or (B, D, H, W).
- Return type:
Tensor
Example
>>> import torch >>> from deepaudiox import AudioClassifier >>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True) >>> waveforms = torch.randn(2, 5 * 16_000) >>> features = model.forward_backbone(waveforms) >>> # features shape: (B, N, D) for Transformer or (B, D, H, W) for CNN backbones
- forward_with_pooling(x)[source]
Forward pass through backbone and pooling.
- Parameters:
x (
Tensor) – x (torch.Tensor): Input waveforms of shape (B, T).- Returns:
Pooled tensor of shape (B, D).
- Return type:
Tensor
Example
>>> import torch >>> from deepaudiox import AudioClassifier >>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True) >>> waveforms = torch.randn(2, 5 * 16_000) >>> embeddings = model.forward_with_pooling(waveforms) >>> # embeddings shape: (B, D)
- classmethod from_checkpoint(path)[source]
Load an AudioClassifierConstructor from a checkpoint saved by the Checkpointer.
- Parameters:
path (
str) – Path to the checkpoint file.- Returns:
Model with weights and config restored.
- Return type:
Example
>>> from deepaudiox import AudioClassifier >>> model = AudioClassifier.from_checkpoint("checkpoint.pt") >>> print(model.config)
- class deepaudiox.modules.constructors.BackboneConstructor(backbone, pretrained=False, freeze_backbone=False, pooling=None, sample_rate=16000, norm_p=None)[source]
Bases:
Module,BackbonePoolingResolverMixinBackbone model wrapper with optional pooling and normalization.
- backbone
Backbone model for feature extraction.
- Type:
- pooling
Pooling layer applied to the backbone feature map.
- Type:
- norm_p
Optional Lp normalization applied after pooling.
- Type:
float or None
- out_dim
Dimension of the backbone model feature map.
- Type:
int
- config
Constructor arguments used to build this model. Used by
from_checkpointto reconstruct the model from a saved checkpoint.- Type:
dict
Initialize the BackboneConstructor.
- Parameters:
backbone (
Union[Literal['beats','passt','mobilenet_05_as','mobilenet_10_as','mobilenet_40_as'],BaseBackbone]) – Backbone name or instance. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.pretrained (
bool) – Whether to load pretrained weights for the backbone.freeze_backbone (
bool) – Whether to freeze the backbone weights during training.pooling (
Union[Literal['gap','simpool','ep'],BasePooling,None]) – Optional pooling layer for aggregation.sample_rate (
int) – Sample frequency for audio input.norm_p (
float|None) – Optional Lp norm applied after pooling. If pooling is None, GAP is used.
Example
>>> from deepaudiox import Backbone >>> backbone = Backbone( ... backbone="beats", ... pretrained=True, ... freeze_backbone=True, ... pooling="gap", ... sample_rate=16000, ... norm_p=2.0, ... )
Note
Available as
deepaudiox.Backbone.- __init__(backbone, pretrained=False, freeze_backbone=False, pooling=None, sample_rate=16000, norm_p=None)[source]
Initialize the BackboneConstructor.
- Parameters:
backbone (
Union[Literal['beats','passt','mobilenet_05_as','mobilenet_10_as','mobilenet_40_as'],BaseBackbone]) – Backbone name or instance. Valid names are: “beats”, “passt”, “mobilenet_05_as”, “mobilenet_10_as”, “mobilenet_40_as”.pretrained (
bool) – Whether to load pretrained weights for the backbone.freeze_backbone (
bool) – Whether to freeze the backbone weights during training.pooling (
Union[Literal['gap','simpool','ep'],BasePooling,None]) – Optional pooling layer for aggregation.sample_rate (
int) – Sample frequency for audio input.norm_p (
float|None) – Optional Lp norm applied after pooling. If pooling is None, GAP is used.
Example
>>> from deepaudiox import Backbone >>> backbone = Backbone( ... backbone="beats", ... pretrained=True, ... freeze_backbone=True, ... pooling="gap", ... sample_rate=16000, ... norm_p=2.0, ... )
- extract_features(waveforms)[source]
Extract backbone-specific features from raw waveforms.
- Parameters:
waveforms (
Tensor) – Input waveforms of shape (B, T).- Returns:
Model-specific input features.
- Return type:
Tensor
- forward(x)[source]
Forward pass through the backbone.
- Parameters:
x (
Tensor) – Input waveforms of shape (B, T).- Returns:
Backbone feature map of shape (B, N, D) or (B, D, H, W).
- Return type:
Tensor
Example
>>> import torch >>> from deepaudiox import Backbone >>> backbone = Backbone(backbone="beats", pretrained=True, sample_rate=16_000) >>> waveforms = torch.randn(2, 5 * 16_000) >>> features = backbone.forward(waveforms) >>> # features shape: (B, N, D) for Transformer or (B, D, H, W) for CNN backbones
- forward_with_pooling(x)[source]
Forward pass through backbone and pooling (with optional normalization).
- Parameters:
x (
Tensor) – Input waveforms of shape (B, T).- Returns:
Pooled tensor of shape (B, D).
- Return type:
Tensor
Example
>>> import torch >>> from deepaudiox import Backbone >>> backbone = Backbone(backbone="beats", pretrained=True, pooling="gap", sample_rate=16_000) >>> waveforms = torch.randn(2, 5 * 16_000) >>> embeddings = backbone.forward_with_pooling(waveforms) >>> # embeddings shape: (B, D)
- classmethod from_checkpoint(path)[source]
Load a BackboneConstructor from a checkpoint saved by the Checkpointer.
- Parameters:
path (
str) – Path to the checkpoint file.- Returns:
Model with weights and config restored.
- Return type:
Example
>>> from deepaudiox import Backbone >>> backbone = Backbone.from_checkpoint("checkpoint.pt") >>> print(backbone.config)
Supported Backbones & Pooling
Type aliases and runtime constants for valid backbone and pooling names.
- deepaudiox.AVAILABLE_BACKBONES = ("beats", "passt", "mobilenet_05_as", "mobilenet_10_as", "mobilenet_40_as")
Supported pretrained backbone names available at runtime.
- deepaudiox.AVAILABLE_POOLING = ("gap", "simpool", "ep")
Supported pooling layer names available at runtime.
- deepaudiox.BackboneName
Type alias:
Literal["beats", "passt", "mobilenet_05_as", "mobilenet_10_as", "mobilenet_40_as"]. Use for type-annotated code.
- deepaudiox.PoolingName
Type alias:
Literal["gap", "simpool", "ep"]. Use for type-annotated code.
Training & Evaluation
Interfaces for training models and evaluating performance on held-out data.
- class deepaudiox.Trainer(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]
Bases:
objectThe core SDK module for training a model.
The Trainer assembles all modules required for training and performs the training process.
- state
Stores training variables.
- Type:
State
- epochs
The maximum number of training epochs.
- Type:
int
- verbose
Whether to log epoch-level artifacts.
- Type:
bool
- device
The device used for training.
- Type:
str
- logger
A module used for logging messages.
- Type:
logging.Logger
- train_dloader
The DataLoader of the training set.
- Type:
torch.DataLoader
- validation_dloader
The DataLoader of the validation set.
- Type:
torch.DataLoader
- model
The BaseAudioClassifier to be trained.
- Type:
- optimizer
The optimizer of the training process.
- Type:
torch.optim.Optimizer
- scheduler
The learning rate scheduler of the training process.
- Type:
LRScheduler
- loss_function
The loss function used for optimization.
- Type:
nn.Module
- callbacks
A list of callbacks used throughout the training lifecycle.
- Type:
list
Initialize the Trainer.
- Parameters:
train_dset (
AudioClassificationDataset) – The training dataset.model (
BaseAudioClassifier) – The model to be trained.validation_dset (
AudioClassificationDataset|None) – The validation dataset. If None, a split is created from train_dset using train_ratio.optimizer (
Optimizer|None) – The optimizer used for training. Adam if None.learning_rate (
float) – Learning rate used when optimizer is None. Defaults to 1e-3.lr_scheduler (
LRScheduler|None) – The scheduler used for training. ReduceLROnPlateau if None.loss_function (
Module|None) – The loss function used for training. Uses CrossEntropy if None.train_ratio (
float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.epochs (
int) – The maximum number of training epochs. Defaults to 100.patience (
int|None) – Epochs to wait without loss improvement before stopping. Disabled if None.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.path_to_checkpoint (
str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.device (
Literal['cuda','mps','cpu']) – The device to use for training. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.
Example
>>> from deepaudiox import AudioClassifier, Trainer >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> train_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10) >>> trainer.train()
- __init__(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]
Initialize the Trainer.
- Parameters:
train_dset (
AudioClassificationDataset) – The training dataset.model (
BaseAudioClassifier) – The model to be trained.validation_dset (
AudioClassificationDataset|None) – The validation dataset. If None, a split is created from train_dset using train_ratio.optimizer (
Optimizer|None) – The optimizer used for training. Adam if None.learning_rate (
float) – Learning rate used when optimizer is None. Defaults to 1e-3.lr_scheduler (
LRScheduler|None) – The scheduler used for training. ReduceLROnPlateau if None.loss_function (
Module|None) – The loss function used for training. Uses CrossEntropy if None.train_ratio (
float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.epochs (
int) – The maximum number of training epochs. Defaults to 100.patience (
int|None) – Epochs to wait without loss improvement before stopping. Disabled if None.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.path_to_checkpoint (
str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.device (
Literal['cuda','mps','cpu']) – The device to use for training. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.
Example
>>> from deepaudiox import AudioClassifier, Trainer >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> train_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10) >>> trainer.train()
- epoch_step()[source]
Run one complete training epoch.
Logs the epoch header and metrics when
verbose=True, callstrain_step()andval_step(), updates the LR scheduler andself.state, then executeson_epoch_endcallbacks (which may trigger early stopping or checkpointing).Note
self.state.current_epochmust be set by the caller before invoking this method —train()does this automatically. When callingepoch_step()directly, set it yourself:trainer.state.current_epoch = epoch.- Returns:
(train_loss, val_loss)for the epoch.- Return type:
tuple[float,float]
Example
>>> from deepaudiox import AudioClassifier, Trainer >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> train_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10) >>> for epoch in range(1, trainer.epochs + 1): ... trainer.state.current_epoch = epoch ... train_loss, val_loss = trainer.epoch_step() ... print(f"Epoch {epoch} — train: {train_loss:.4f}, val: {val_loss:.4f}") ... if trainer.state.early_stop: ... break
- train()[source]
Perform the full training process.
Epoch-level output is controlled by
verbose. The training summary (best epoch, losses, checkpoint path) is always printed on completion.- Return type:
None
- class deepaudiox.Evaluator(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]
Bases:
objectThe core SDK module for testing a model.
The Evaluator assembles all modules required for testing and performs the testing process.
- state
Stores testing variables.
- Type:
State
- verbose
Whether to log the evaluation report after testing.
- Type:
bool
- device
The device used for testing.
- Type:
str
- class_mapping
A mapping between class names and IDs.
- Type:
dict
- logger
A module used for logging messages.
- Type:
logging.Logger
- test_dloader
The DataLoader of the testing set.
- Type:
torch.DataLoader
- model
An AudioClassifier module inheriting from BaseAudioClassifier.
- Type:
- callbacks
A list of callbacks used throughout the testing lifecycle.
- Type:
list
Initialize the Evaluator.
- Parameters:
test_dset (
AudioClassificationDataset) – The testing dataset.model (
BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.class_mapping (
dict) – A mapping between class names and IDs.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.device (
Literal['cuda','mps','cpu']) – The device to use for evaluation. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.
Example
>>> import torch >>> from deepaudiox import AudioClassifier, Evaluator >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> test_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> model.load_state_dict(torch.load("checkpoint.pt")) >>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping) >>> evaluator.evaluate()
- __init__(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]
Initialize the Evaluator.
- Parameters:
test_dset (
AudioClassificationDataset) – The testing dataset.model (
BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.class_mapping (
dict) – A mapping between class names and IDs.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.device (
Literal['cuda','mps','cpu']) – The device to use for evaluation. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.
Example
>>> import torch >>> from deepaudiox import AudioClassifier, Evaluator >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> test_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> model.load_state_dict(torch.load("checkpoint.pt")) >>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping) >>> evaluator.evaluate()
- evaluate()[source]
Run the full evaluation loop over the test set.
Iterates over all batches in
test_dloader, accumulates true labels, predicted labels, and posterior probabilities intoself.state, then triggers the registered callbacks viaon_testing_end.Always prints “Evaluation has finished.” regardless of
verbose. TheReportercallback (classification report, confusion matrix, average posteriors) is only executed whenverbose=True.- Return type:
None
- After this method returns,
self.stateholds: y_true(np.ndarray): Ground-truth class indices, shape (N,).y_pred(np.ndarray): Predicted class indices, shape (N,).posteriors(np.ndarray): Max posterior probability per sample, shape (N,).
Note
The model is expected to already be in eval mode (set in
__init__). Runs undertorch.inference_mode()— gradients are fully disabled.
Base Classes & Inference
Base interfaces and inference helpers used across models.
BaseClasses for abstracting nn modules (e.g., backbones, pooling layers, classifiers)
- class deepaudiox.modules.baseclasses.BaseAudioClassifier(*args, **kwargs)[source]
Bases:
Module,ABCBase class for creating custom audio classifiers.
This class defines the standard interface for audio classification models. Subclasses must implement the core initialization and forward pass. The built-in predict method provides a convenience wrapper to obtain predicted labels, posterior probabilities, and raw logits.
Initialize the audio classifier.
- abstractmethod forward(x)[source]
Pass the input through the model and return logits.
- Parameters:
x (
Tensor) – The input tensor.
- inference_on_file(path, sample_rate, class_mapping, segment_duration=None, batch_size=4)[source]
Get prediction for an audio sample from a file path.
- Parameters:
path (
str|Path) – Path to an audio file supported by librosa (e.g., WAV or MP3).sample_rate (
int) – Sampling rate of audio sample.class_mapping (
dict[str,int]) – Class-to-index mapping as it is used by the model.segment_duration (
float|None) – Optional segment duration in seconds for segment-level inference. If provided, the last remainder is right-padded to a full segment.batch_size (
int) – Optional batch size for segment-level inference. Default is 4.
- Returns:
- A dictionary with keys:
final_label(str): Predicted class label.final_posterior(float): Posterior probability for the predicted class.segment_labels(list[str] | None): Per-segment labels when segmenting is used.segment_posteriors(list[float] | None): Per-segment posteriors aligned withsegment_labelswhen segmenting is used.
- Return type:
dict
Example
>>> from deepaudiox import AudioClassifier >>> class_mapping = {"speech": 0, "music": 1} >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> prediction = model.inference_on_file( ... "path/to/audio.wav", ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=2.0, ... batch_size=4, ... )
- inference_on_waveform(x, sample_rate, class_mapping, segment_duration=None, batch_size=4)[source]
Get prediction on a waveform.
- Parameters:
x (
Tensor|ndarray) – Input waveform to be used for inference. Accepts shape (T,).sample_rate (
int) – Sampling rate of audio sample.class_mapping (
dict[str,int]) – Class-to-index mapping that is used by the model.segment_duration (
float|None) – Optional segment duration in seconds for segment-level inference. If provided, the last remainder is right-padded to a full segment.batch_size (
int) – Optional batch size for segment-level inference. Default is 4.
- Returns:
- A dictionary with keys:
final_label(str): Predicted class label.final_posterior(float): Posterior probability for the predicted class.segment_labels(list[str] | None): Per-segment labels when segmenting is used.segment_posteriors(list[float] | None): Per-segment posteriors aligned withsegment_labelswhen segmenting is used.
- Return type:
dict
Example
>>> import torch >>> from deepaudiox import AudioClassifier >>> class_mapping = {"speech": 0, "music": 1} >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> waveform = torch.randn(5 * 16_000) >>> prediction = model.inference_on_waveform( ... waveform, ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=1.0, ... batch_size=4, ... )
- predict(x)[source]
Compute predicted class and posterior probabilities.
This is a low-level method that does not manage model mode or gradient context. The caller is responsible for calling
model.eval()and wrapping withtorch.no_grad()ortorch.inference_mode()as needed. For end-to-end inference with automatic mode management, useinference_on_waveformorinference_on_fileinstead.- Parameters:
x (
Tensor) – Input waveforms of shape (B, T), where T is the number of audio samples.- Returns:
y_preds, posteriors, logits.
- Return type:
dict[str,ndarray]
Example
>>> import torch >>> from deepaudiox import AudioClassifier >>> model = AudioClassifier(num_classes=10, backbone="beats", sample_rate=16_000, pretrained=True) >>> model.eval() >>> waveforms = torch.randn(2, 5 * 16_000) >>> with torch.no_grad(): ... outputs = model.predict(waveforms)
- class deepaudiox.modules.baseclasses.BaseBackbone(out_dim, sample_rate)[source]
Bases:
Module,ABCAbstract base class for all audio backbone models.
This class defines the common interface for backbone architectures that convert raw waveforms into fixed-dimensional embeddings. Subclasses must implement the core feature extraction and forward-processing logic.
Initialize the BaseBackbone.
- Parameters:
out_dim (
int) – Output dim of the backbone feature map. For CNNs the embeddings are of shape (B, C, H, W)shape (could be of)
shape
sample_rate (
int) – Sample rate for audio input.
- __init__(out_dim, sample_rate)[source]
Initialize the BaseBackbone.
- Parameters:
out_dim (
int) – Output dim of the backbone feature map. For CNNs the embeddings are of shape (B, C, H, W)shape (could be of)
shape
sample_rate (
int) – Sample rate for audio input.
- abstractmethod extract_features(waveforms)[source]
Convert raw waveforms into internal acoustic features.
- Parameters:
waveforms (
Tensor) – Tensor of shape (B, T).- Returns:
Model-specific feature representation before final forward().
- Return type:
Tensor
- abstractmethod forward(x, padding_mask=None)[source]
Compute embeddings from input features.
- Parameters:
x (
Tensor) – Input audio-specific features of shape (B, 1, F, T) or (B, 1, T, F).padding_mask (
Tensor|None) – Optional padding mask.
- Returns:
Embeddings of shape (B, N, D) or (B, D, H, W), where D is the embedding dimension.
- Return type:
Tensor
- forward_pipeline(x)[source]
Standard processing pipeline:
Extract features from raw audio
Pass features through forward()
- Parameters:
x (
Tensor) – Input waveforms of shape (B, T), where T is the length of waveforms.- Returns:
Final model output of shape (B, D, H, W) for CNNs or (B, N, D) for Transformers.
- Return type:
Tensor
- class deepaudiox.modules.baseclasses.BasePooling(in_dim=None)[source]
Bases:
Module,ABCAbstract base class for all pooling modules.
This class defines the interface for pooling that operate an input feature map obtained from a CNN or a Transformer BaseBackbone. Subclasses must implement the forward-processing logic. The input is expected to be a feature map of shape (B, D, H, W) for CNNs or (B, T, D) for Transformers.
Initialize the BasePooling.
- Parameters:
in_dim (
int|None) – Input dimension. This is D for both CNNs and Transformers.
Full Paths
The API re-exports the following symbols. If you prefer importing from the original modules, use these paths:
AudioClassifier->deepaudiox.modules.constructors.AudioClassifierConstructorBackbone->deepaudiox.modules.constructors.BackboneConstructorAudioClassificationDataset->deepaudiox.datasets.audio_classification_dataset.AudioClassificationDatasetaudio_classification_dataset_from_dir->deepaudiox.datasets.audio_classification_dataset.audio_classification_dataset_from_diraudio_classification_dataset_from_dictionary->deepaudiox.datasets.audio_classification_dataset.audio_classification_dataset_from_dictionaryget_class_mapping_from_dir->deepaudiox.utils.training_utils.get_class_mapping_from_dirget_class_mapping_from_list->deepaudiox.utils.training_utils.get_class_mapping_from_listTrainer->deepaudiox.loops.trainer.TrainerEvaluator->deepaudiox.loops.evaluator.EvaluatorBackboneName->deepaudiox.schemas.types.BackboneNamePoolingName->deepaudiox.schemas.types.PoolingNameAVAILABLE_BACKBONES->deepaudiox.__init__.AVAILABLE_BACKBONESAVAILABLE_POOLING->deepaudiox.__init__.AVAILABLE_POOLING