deepaudiox
This page provides the core API reference for DeepAudioX.
- deepaudiox.AVAILABLE_BACKBONES: tuple[str, ...] = ('beats', 'passt', 'mobilenet_05_as', 'mobilenet_10_as', 'mobilenet_40_as')
Supported pretrained backbone names available at runtime.
- deepaudiox.AVAILABLE_POOLING: tuple[str, ...] = ('gap', 'simpool', 'ep')
Supported pooling layer names available at runtime.
- class deepaudiox.AudioClassificationDataset(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]
Bases:
DatasetPyTorch Dataset for audio classification tasks.
This dataset loads audio files and returns a dictionary containing the raw waveform (under the key
"feature"), the corresponding class name, and the integer class ID defined inclass_mapping. Thefile_to_class_mappingargument must be a dictionary of the form:{"abs/path/to/audio.wav": "class_name"}
Optionally, the dataset can segment each audio file into fixed-duration chunks using
segment_duration. When enabled, each segment becomes an individual dataset sample.- file_to_class_mapping
Mapping from file paths to class names.
- Type:
dict
- sample_rate
Target sampling rate for audio loading.
- Type:
int
- class_mapping
Mapping from string class labels to integer IDs.
- Type:
dict
Initialize the dataset.
- Parameters:
file_to_class_mapping (
dict[str|PathLike,str]) – Mapping from file paths to class names.sample_rate (
int) – Target sampling rate for audio loading.class_mapping (
dict[str,int]) – Mapping from string labels to integer IDs.segment_duration (
float|None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.
Example
>>> from deepaudiox import AudioClassificationDataset >>> file_to_class_mapping = { ... "path/to/audio1.wav": "speech", ... "path/to/audio2.wav": "music", ... } >>> class_mapping = {"speech": 0, "music": 1} >>> dataset = AudioClassificationDataset( ... file_to_class_mapping=file_to_class_mapping, ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=2.0, ... )
- __init__(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]
Initialize the dataset.
- Parameters:
file_to_class_mapping (
dict[str|PathLike,str]) – Mapping from file paths to class names.sample_rate (
int) – Target sampling rate for audio loading.class_mapping (
dict[str,int]) – Mapping from string labels to integer IDs.segment_duration (
float|None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.
Example
>>> from deepaudiox import AudioClassificationDataset >>> file_to_class_mapping = { ... "path/to/audio1.wav": "speech", ... "path/to/audio2.wav": "music", ... } >>> class_mapping = {"speech": 0, "music": 1} >>> dataset = AudioClassificationDataset( ... file_to_class_mapping=file_to_class_mapping, ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=2.0, ... )
- deepaudiox.AudioClassifier
alias of
AudioClassifierConstructor
- deepaudiox.Backbone
alias of
BackboneConstructor
- class deepaudiox.Evaluator(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]
Bases:
objectThe core SDK module for testing a model.
The Evaluator assembles all modules required for testing and performs the testing process.
- state
Stores testing variables.
- Type:
State
- verbose
Whether to log the evaluation report after testing.
- Type:
bool
- device
The device used for testing.
- Type:
str
- class_mapping
A mapping between class names and IDs.
- Type:
dict
- logger
A module used for logging messages.
- Type:
logging.Logger
- test_dloader
The DataLoader of the testing set.
- Type:
torch.DataLoader
- model
An AudioClassifier module inheriting from BaseAudioClassifier.
- Type:
- callbacks
A list of callbacks used throughout the testing lifecycle.
- Type:
list
Initialize the Evaluator.
- Parameters:
test_dset (
AudioClassificationDataset) – The testing dataset.model (
BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.class_mapping (
dict) – A mapping between class names and IDs.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.device (
Literal['cuda','mps','cpu']) – The device to use for evaluation. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.
Example
>>> import torch >>> from deepaudiox import AudioClassifier, Evaluator >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> test_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> model.load_state_dict(torch.load("checkpoint.pt")) >>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping) >>> evaluator.evaluate()
- __init__(test_dset, model, class_mapping, batch_size=16, num_workers=4, device='cpu', device_index=None, verbose=True)[source]
Initialize the Evaluator.
- Parameters:
test_dset (
AudioClassificationDataset) – The testing dataset.model (
BaseAudioClassifier) – An AudioClassifier module inheriting from BaseAudioClassifier.class_mapping (
dict) – A mapping between class names and IDs.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.device (
Literal['cuda','mps','cpu']) – The device to use for evaluation. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, prints the classification report, confusion matrix, and average posteriors after evaluation. “Evaluation has finished.” is always printed. Defaults to True.
Example
>>> import torch >>> from deepaudiox import AudioClassifier, Evaluator >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> test_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> model.load_state_dict(torch.load("checkpoint.pt")) >>> evaluator = Evaluator(test_dset=test_dataset, model=model, class_mapping=class_mapping) >>> evaluator.evaluate()
- evaluate()[source]
Run the full evaluation loop over the test set.
Iterates over all batches in
test_dloader, accumulates true labels, predicted labels, and posterior probabilities intoself.state, then triggers the registered callbacks viaon_testing_end.Always prints “Evaluation has finished.” regardless of
verbose. TheReportercallback (classification report, confusion matrix, average posteriors) is only executed whenverbose=True.- Return type:
None
- After this method returns,
self.stateholds: y_true(np.ndarray): Ground-truth class indices, shape (N,).y_pred(np.ndarray): Predicted class indices, shape (N,).posteriors(np.ndarray): Max posterior probability per sample, shape (N,).
Note
The model is expected to already be in eval mode (set in
__init__). Runs undertorch.inference_mode()— gradients are fully disabled.
- class deepaudiox.Trainer(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]
Bases:
objectThe core SDK module for training a model.
The Trainer assembles all modules required for training and performs the training process.
- state
Stores training variables.
- Type:
State
- epochs
The maximum number of training epochs.
- Type:
int
- verbose
Whether to log epoch-level artifacts.
- Type:
bool
- device
The device used for training.
- Type:
str
- logger
A module used for logging messages.
- Type:
logging.Logger
- train_dloader
The DataLoader of the training set.
- Type:
torch.DataLoader
- validation_dloader
The DataLoader of the validation set.
- Type:
torch.DataLoader
- model
The BaseAudioClassifier to be trained.
- Type:
- optimizer
The optimizer of the training process.
- Type:
torch.optim.Optimizer
- scheduler
The learning rate scheduler of the training process.
- Type:
LRScheduler
- loss_function
The loss function used for optimization.
- Type:
nn.Module
- callbacks
A list of callbacks used throughout the training lifecycle.
- Type:
list
Initialize the Trainer.
- Parameters:
train_dset (
AudioClassificationDataset) – The training dataset.model (
BaseAudioClassifier) – The model to be trained.validation_dset (
AudioClassificationDataset|None) – The validation dataset. If None, a split is created from train_dset using train_ratio.optimizer (
Optimizer|None) – The optimizer used for training. Adam if None.learning_rate (
float) – Learning rate used when optimizer is None. Defaults to 1e-3.lr_scheduler (
LRScheduler|None) – The scheduler used for training. ReduceLROnPlateau if None.loss_function (
Module|None) – The loss function used for training. Uses CrossEntropy if None.train_ratio (
float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.epochs (
int) – The maximum number of training epochs. Defaults to 100.patience (
int|None) – Epochs to wait without loss improvement before stopping. Disabled if None.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.path_to_checkpoint (
str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.device (
Literal['cuda','mps','cpu']) – The device to use for training. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.
Example
>>> from deepaudiox import AudioClassifier, Trainer >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> train_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10) >>> trainer.train()
- __init__(train_dset, model, validation_dset=None, optimizer=None, learning_rate=0.001, lr_scheduler=None, loss_function=None, train_ratio=0.8, epochs=100, patience=None, num_workers=4, batch_size=16, path_to_checkpoint='checkpoint.pt', device='cpu', device_index=None, verbose=True)[source]
Initialize the Trainer.
- Parameters:
train_dset (
AudioClassificationDataset) – The training dataset.model (
BaseAudioClassifier) – The model to be trained.validation_dset (
AudioClassificationDataset|None) – The validation dataset. If None, a split is created from train_dset using train_ratio.optimizer (
Optimizer|None) – The optimizer used for training. Adam if None.learning_rate (
float) – Learning rate used when optimizer is None. Defaults to 1e-3.lr_scheduler (
LRScheduler|None) – The scheduler used for training. ReduceLROnPlateau if None.loss_function (
Module|None) – The loss function used for training. Uses CrossEntropy if None.train_ratio (
float) – The ratio of the train split when validation_dset is None. Defaults to 0.8.epochs (
int) – The maximum number of training epochs. Defaults to 100.patience (
int|None) – Epochs to wait without loss improvement before stopping. Disabled if None.num_workers (
int) – The number of workers for Python Data Loaders. Defaults to 4.batch_size (
int) – The batch size for Python Data Loaders. Defaults to 16.path_to_checkpoint (
str) – The path to the saved model checpoint. Defaults to “checkpoint.pt”.device (
Literal['cuda','mps','cpu']) – The device to use for training. One of"cuda","mps", or"cpu". Defaults to"cpu".device_index (
int|None) – The GPU device index. Only applicable whendevice="cuda". IfNone, uses the default CUDA device.verbose (
bool) – If True, logs epoch-level artifacts (loss, time). If False, only start/end messages and the final training summary are printed. Defaults to True.
Example
>>> from deepaudiox import AudioClassifier, Trainer >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> train_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10) >>> trainer.train()
- epoch_step()[source]
Run one complete training epoch.
Logs the epoch header and metrics when
verbose=True, callstrain_step()andval_step(), updates the LR scheduler andself.state, then executeson_epoch_endcallbacks (which may trigger early stopping or checkpointing).Note
self.state.current_epochmust be set by the caller before invoking this method —train()does this automatically. When callingepoch_step()directly, set it yourself:trainer.state.current_epoch = epoch.- Returns:
(train_loss, val_loss)for the epoch.- Return type:
tuple[float,float]
Example
>>> from deepaudiox import AudioClassifier, Trainer >>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> train_dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... ) >>> model = AudioClassifier(num_classes=len(class_mapping), backbone="beats", sample_rate=16_000) >>> trainer = Trainer(train_dset=train_dataset, model=model, epochs=10) >>> for epoch in range(1, trainer.epochs + 1): ... trainer.state.current_epoch = epoch ... train_loss, val_loss = trainer.epoch_step() ... print(f"Epoch {epoch} — train: {train_loss:.4f}, val: {val_loss:.4f}") ... if trainer.state.early_stop: ... break
- train()[source]
Perform the full training process.
Epoch-level output is controlled by
verbose. The training summary (best epoch, losses, checkpoint path) is always printed on completion.- Return type:
None
- deepaudiox.audio_classification_dataset_from_dictionary(file_to_class_mapping, sample_rate, class_mapping, segment_duration=None)[source]
Create an AudioClassificationDataset from a file-to-class mapping dictionary.
- Parameters:
file_to_class_mapping (
dict[str|PathLike,str]) – Mapping from file paths to class names.sample_rate (
int) – Target sampling rate for audio loading.class_mapping (
dict[str,int]) – Mapping from string labels to integer IDs.segment_duration (
float|None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.
- Returns:
The constructed dataset.
- Return type:
Example
>>> from deepaudiox import audio_classification_dataset_from_dictionary >>> file_to_class_mapping = { ... "path/to/audio1.wav": "speech", ... "path/to/audio2.wav": "music", ... } >>> class_mapping = {"speech": 0, "music": 1} >>> dataset = audio_classification_dataset_from_dictionary( ... file_to_class_mapping=file_to_class_mapping, ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=None, ... )
- deepaudiox.audio_classification_dataset_from_dir(root_dir, sample_rate, class_mapping, segment_duration=None)[source]
Create an AudioClassificationDataset from a directory structure.
- Parameters:
root_dir (
str) – Root directory containing class sub-folders. Only.wavand.mp3files are used.sample_rate (
int) – Target sampling rate for audio loading.class_mapping (
dict[str,int]) – Mapping from string labels to integer IDs.segment_duration (
float|None) – Duration of audio segments in seconds. If None, load full audio. When set, the last partial segment is dropped.
- Returns:
The constructed dataset.
- Return type:
Example
>>> from deepaudiox import audio_classification_dataset_from_dir, get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> dataset = audio_classification_dataset_from_dir( ... root_dir="path/to/data", ... sample_rate=16_000, ... class_mapping=class_mapping, ... segment_duration=2.0, ... )
- deepaudiox.get_class_mapping_from_dir(root_dir)[source]
Load the class mapping given a folder of class sub-folders.
Expected directory structure:
root_dir/ ├── class_a/ │ ├── audio1.wav │ └── audio2.wav └── class_b/ ├── audio3.wav └── audio4.wav- Parameters:
root_dir (
str) – The path to root folder- Returns:
The class mapping dictionary, ordered alphabetically by folder name.
- Return type:
dict[str,int]
Example
>>> from deepaudiox import get_class_mapping_from_dir >>> class_mapping = get_class_mapping_from_dir(root_dir="path/to/data") >>> # Example output: >>> # {'class_a': 0, 'class_b': 1}
- deepaudiox.get_class_mapping_from_list(labels, sort_alphabetically=True)[source]
Get a class mapping dictionary given a list of class names.
- Parameters:
labels (
list[str]) – List of class namessort_alphabetically (
bool) – Determines if alphabetical sorting should be applied to class names.
- Returns:
The class mapping dictionary
- Return type:
dict[str,int]
Example
>>> from deepaudiox import get_class_mapping_from_list >>> labels = ["speech", "music", "noise"] >>> class_mapping = get_class_mapping_from_list(labels, sort_alphabetically=True) >>> # Example output: >>> # {'music': 0, 'noise': 1, 'speech': 2}