Getting started
Installation
Requires Python 3.12 and Git LFS. Git LFS is needed because bundled model checkpoints are stored as LFS objects — if you clone without it the checkpoint files will be stubs and the models will fail to load.
With micromamba (recommended):
micromamba create -y -n vocalpy python=3.12
micromamba activate vocalpy
git clone https://github.com/gumadeiras/vocalpy.git
cd vocalpy
pip install --upgrade pip
pip install -r requirements-dev.txt
With venv:
python3.12 -m venv .venv
source .venv/bin/activate
git clone https://github.com/gumadeiras/vocalpy.git
cd vocalpy
pip install --upgrade pip
pip install -r requirements-dev.txt
Running the CLI
vocalpy -p /path/to/recording.wav
This runs the full pipeline on a single file: the audio is split into overlapping chunks and processed in parallel, detected vocalizations are filtered for noise and then labeled by type, and all results are written to an output folder next to the audio file. By default the mouse pipeline is used. Pass -a rat or -a guineapig to switch species.
To process all .wav files in a directory at once:
vocalpy -p /path/to/recordings/
Each file gets its own {name}_outputs/ directory.
CLI options
Flag |
Description |
Default |
|---|---|---|
|
Species pipeline: |
|
|
Path to a |
required |
|
Audio chunk size in seconds for parallel processing |
|
|
Low frequency cutoff in Hz — signals below this are ignored |
Species default |
|
High frequency cutoff in Hz — signals above this are ignored |
Species default |
|
Number of parallel workers ( |
|
|
Print detailed progress to the terminal |
off |
|
Save spectrogram-overlay images for manual review of detections |
off |
|
Run autoencoder-based segmentation (SqueakOut) after detection and classification |
off |
|
Path to a custom SqueakOut checkpoint file |
Bundled |
|
Probability threshold for converting segmentation output to a binary mask |
|
Tuning tips:
If you’re getting too many false positives, try narrowing the frequency range with
-lf/-hfto match what you expect in your recordings.For long recordings, increasing
-breduces overhead; decreasing it can help on machines with many cores.Use
-lwhen you’re setting up a new recording type or debugging detections — the overlay images show exactly what the detector found on the spectrogram.
Species defaults
Each species pipeline has tuned spectrogram and detection parameters. The frequency ranges reflect the typical call frequencies for each species. You can override the cutoffs with -lf / -hf if your recordings differ.
Species |
Frequency range |
Window type |
Window size |
NFFT |
|---|---|---|---|---|
mouse |
45,000–125,000 Hz |
Hamming |
256 |
1024 |
rat |
18,000–125,000 Hz |
Hamming |
256 |
1024 |
guineapig |
250–20,000 Hz |
Barthann |
512 |
1024 |
What runs per species:
Mouse and rat: detection → noise classification (removes non-vocalizations) → type classification (labels each call) → optional segmentation
Guinea pig: detection only — no classifier is available, so all candidates are kept as-is. Neural segmentation via
--segmenteris still available.
Output
For a file named recording.wav, outputs are written to recording_outputs/ in the same directory:
recording_outputs/
├── recording.wav.csv # vocalization metadata table — start here
├── recording_without_spectrograms.vocalpy # full Recording object, reloadable in Python
├── list_of_vocals.vocalpy # ListOfVocals object, reloadable in Python
├── params.yml # exact parameters used for this run
├── spectrogram/ # per-vocal spectrogram image (one PNG per call)
├── mask/ # per-vocal binary detection mask (one PNG per call)
├── spectrogram_validation/ # spectrogram + mask overlay images (-l flag only)
└── cnn_mask/ # autoencoder-based segmentation masks (--segmenter only)
The CSV is the fastest way to inspect results — open it in any spreadsheet tool or load it with pandas. The .vocalpy files let you reload the full pipeline output in Python for further analysis, filtering, or visualization without rerunning detection. The spectrogram images are useful for quickly browsing individual calls. The validation overlays (-l) show the detection mask drawn on top of the spectrogram, which is helpful for understanding what the detector found and catching misdetections.
CSV columns
Each row is one detected vocalization. For mouse and rat, top1 and top2 contain the classifier’s type predictions.
Column |
Description |
|---|---|
|
Absolute time of the call in seconds from the start of the recording |
|
Call duration in seconds |
|
Silence between this call and the previous one, in seconds |
|
Lowest, highest, and mean frequency of the call in Hz |
|
Frequency span of the call ( |
|
Spectrogram intensity range and mean within the detected region |
|
Size of the detected region in spectrogram pixels — larger values mean longer or wider calls |
|
Center of mass of the detected region as (time, frequency) coordinates in the spectrogram |
|
Angle of the principal axis of the detected region — useful for characterizing call shape |
|
Top-1 and top-2 class label predictions from the type classifier (mouse and rat only) |
Autoencoder-based segmentation
vocalpy -p /path/to/recording.wav --segmenter
After detection and classification, SqueakOut runs on each detected vocal crop and produces a pixel-level binary mask that outlines the vocalization within the spectrogram. This gives finer spatial information than the bounding-box style detection mask. Masks are saved as PNG images under cnn_mask/, one per detected call.
Input: each detected call’s spectrogram crop is resized to grayscale
1×512×512before being fed to the modelOutput: a binary mask at the same resolution as the crop, where white pixels mark the vocalization
Threshold: the raw model output is a probability map; pixels above the threshold become the mask. The default is
0.51— lower it to include more of the call boundary, raise it to be more conservative. Override with--segmentation_thresholdCustom model: the bundled SqueakOut checkpoint is used by default. To use a different checkpoint, pass its path with
--segmentation_model_path
Serialized outputs
The .vocalpy files are Python objects serialized to disk. They let you reload a prior run in Python without re-running the pipeline:
from vocalpy.utils.io import load_vocalpy_file
recording = load_vocalpy_file("recording_outputs/recording_without_spectrograms.vocalpy")
list_of_vocals = load_vocalpy_file("recording_outputs/list_of_vocals.vocalpy")
Files use a versioned envelope format with object-type metadata. Legacy raw-pickle .vocalpy files written by older versions load automatically for backward compatibility.
Packaging notes
Project metadata:
pyproject.tomlTested dependency pins:
constraints/base.txtandconstraints/dev.txtBundled model checkpoints and sidecar metadata:
vocalpy/nn/pretrained/Default pipeline parameters per species:
vocalpy/configs/pipelines_parameters.yml