Getting started

Installation

Requires Python 3.12 and Git LFS. Git LFS is needed because bundled model checkpoints are stored as LFS objects — if you clone without it the checkpoint files will be stubs and the models will fail to load.

With micromamba (recommended):

micromamba create -y -n vocalpy python=3.12
micromamba activate vocalpy
git clone https://github.com/gumadeiras/vocalpy.git
cd vocalpy
pip install --upgrade pip
pip install -r requirements-dev.txt

With venv:

python3.12 -m venv .venv
source .venv/bin/activate
git clone https://github.com/gumadeiras/vocalpy.git
cd vocalpy
pip install --upgrade pip
pip install -r requirements-dev.txt

Running the CLI

vocalpy -p /path/to/recording.wav

This runs the full pipeline on a single file: the audio is split into overlapping chunks and processed in parallel, detected vocalizations are filtered for noise and then labeled by type, and all results are written to an output folder next to the audio file. By default the mouse pipeline is used. Pass -a rat or -a guineapig to switch species.

To process all .wav files in a directory at once:

vocalpy -p /path/to/recordings/

Each file gets its own {name}_outputs/ directory.

CLI options

Flag

Description

Default

-a / --animal

Species pipeline: mouse, rat, guineapig

mouse

-p / --path_to_audio

Path to a .wav file or a directory of .wav files

required

-b / --bin_size

Audio chunk size in seconds for parallel processing

60

-lf / --lower_frequency_cutoff

Low frequency cutoff in Hz — signals below this are ignored

Species default

-hf / --higher_frequency_cutoff

High frequency cutoff in Hz — signals above this are ignored

Species default

-t / --threads

Number of parallel workers (-1 = half of available cores)

-1

-v / --verbose

Print detailed progress to the terminal

off

-l / --validation

Save spectrogram-overlay images for manual review of detections

off

--segmenter

Run autoencoder-based segmentation (SqueakOut) after detection and classification

off

--segmentation_model_path

Path to a custom SqueakOut checkpoint file

Bundled

--segmentation_threshold

Probability threshold for converting segmentation output to a binary mask

0.51

Tuning tips:

  • If you’re getting too many false positives, try narrowing the frequency range with -lf / -hf to match what you expect in your recordings.

  • For long recordings, increasing -b reduces overhead; decreasing it can help on machines with many cores.

  • Use -l when you’re setting up a new recording type or debugging detections — the overlay images show exactly what the detector found on the spectrogram.

Species defaults

Each species pipeline has tuned spectrogram and detection parameters. The frequency ranges reflect the typical call frequencies for each species. You can override the cutoffs with -lf / -hf if your recordings differ.

Species

Frequency range

Window type

Window size

NFFT

mouse

45,000–125,000 Hz

Hamming

256

1024

rat

18,000–125,000 Hz

Hamming

256

1024

guineapig

250–20,000 Hz

Barthann

512

1024

What runs per species:

  • Mouse and rat: detection → noise classification (removes non-vocalizations) → type classification (labels each call) → optional segmentation

  • Guinea pig: detection only — no classifier is available, so all candidates are kept as-is. Neural segmentation via --segmenter is still available.

Output

For a file named recording.wav, outputs are written to recording_outputs/ in the same directory:

recording_outputs/
├── recording.wav.csv                       # vocalization metadata table — start here
├── recording_without_spectrograms.vocalpy  # full Recording object, reloadable in Python
├── list_of_vocals.vocalpy                  # ListOfVocals object, reloadable in Python
├── params.yml                              # exact parameters used for this run
├── spectrogram/                            # per-vocal spectrogram image (one PNG per call)
├── mask/                                   # per-vocal binary detection mask (one PNG per call)
├── spectrogram_validation/                 # spectrogram + mask overlay images (-l flag only)
└── cnn_mask/                               # autoencoder-based segmentation masks (--segmenter only)

The CSV is the fastest way to inspect results — open it in any spreadsheet tool or load it with pandas. The .vocalpy files let you reload the full pipeline output in Python for further analysis, filtering, or visualization without rerunning detection. The spectrogram images are useful for quickly browsing individual calls. The validation overlays (-l) show the detection mask drawn on top of the spectrogram, which is helpful for understanding what the detector found and catching misdetections.

CSV columns

Each row is one detected vocalization. For mouse and rat, top1 and top2 contain the classifier’s type predictions.

Column

Description

start / end

Absolute time of the call in seconds from the start of the recording

duration

Call duration in seconds

interval

Silence between this call and the previous one, in seconds

min_freq / max_freq / avg_freq

Lowest, highest, and mean frequency of the call in Hz

bandwidth

Frequency span of the call (max_freq - min_freq) in Hz

min_intensity / max_intensity / avg_intensity

Spectrogram intensity range and mean within the detected region

area

Size of the detected region in spectrogram pixels — larger values mean longer or wider calls

centroid

Center of mass of the detected region as (time, frequency) coordinates in the spectrogram

orientation

Angle of the principal axis of the detected region — useful for characterizing call shape

top1 / top2

Top-1 and top-2 class label predictions from the type classifier (mouse and rat only)

Autoencoder-based segmentation

vocalpy -p /path/to/recording.wav --segmenter

After detection and classification, SqueakOut runs on each detected vocal crop and produces a pixel-level binary mask that outlines the vocalization within the spectrogram. This gives finer spatial information than the bounding-box style detection mask. Masks are saved as PNG images under cnn_mask/, one per detected call.

  • Input: each detected call’s spectrogram crop is resized to grayscale 1×512×512 before being fed to the model

  • Output: a binary mask at the same resolution as the crop, where white pixels mark the vocalization

  • Threshold: the raw model output is a probability map; pixels above the threshold become the mask. The default is 0.51 — lower it to include more of the call boundary, raise it to be more conservative. Override with --segmentation_threshold

  • Custom model: the bundled SqueakOut checkpoint is used by default. To use a different checkpoint, pass its path with --segmentation_model_path

Serialized outputs

The .vocalpy files are Python objects serialized to disk. They let you reload a prior run in Python without re-running the pipeline:

from vocalpy.utils.io import load_vocalpy_file

recording = load_vocalpy_file("recording_outputs/recording_without_spectrograms.vocalpy")
list_of_vocals = load_vocalpy_file("recording_outputs/list_of_vocals.vocalpy")

Files use a versioned envelope format with object-type metadata. Legacy raw-pickle .vocalpy files written by older versions load automatically for backward compatibility.

Packaging notes

  • Project metadata: pyproject.toml

  • Tested dependency pins: constraints/base.txt and constraints/dev.txt

  • Bundled model checkpoints and sidecar metadata: vocalpy/nn/pretrained/

  • Default pipeline parameters per species: vocalpy/configs/pipelines_parameters.yml