Voice Quality Metrics: Jitter, Shimmer, HNR, and AVQI Explained
What you'll learn: precise definitions and computation formulas for the four core acoustic voice biomarkers — jitter, shimmer, HNR, and AVQI — how to extract them in Python using parselmouth/Praat, their clinical interpretation ranges, and how recording quality directly affects the reliability of every metric.
A clinician listens to a patient sustain the vowel /a/ for three seconds and describes the voice as "rough and slightly breathy." An acoustic analysis of the same recording reports a jitter of 1.8%, a shimmer of 4.2 dB, an HNR of 14.3 dB, and an AVQI of 3.7. These two descriptions encode the same perceptual reality — but the numbers are reproducible, comparable across clinics, and trackable over the course of therapy.
This article explains exactly what those four numbers mean, how they are computed from the raw audio waveform, what their normal and pathological ranges are, and why the recording conditions (level, noise floor, clipping) can silently corrupt them if you are not careful.
1. Why Voice Quality Metrics Matter
Dysphonia — any voice disorder that impairs the normal production of speech — affects an estimated 7% of the general population and a much higher proportion of professional voice users and patients with neurological conditions. Assessing dysphonia traditionally relied on perceptual scales such as the GRBAS (Grade, Roughness, Breathiness, Asthenia, Strain) or the VHI (Voice Handicap Index) questionnaire. Both are subjective and rater-dependent.
Acoustic biomarkers extracted from a sustained phonation recording — typically the vowel /a/ held for 3–5 seconds at a comfortable loudness level and microphone distance of 30 cm — provide an objective, repeatable complement to perceptual scales. They are used for:
- Initial screening and severity quantification of dysphonia
- Pre- and post-surgical voice assessment (e.g., vocal fold surgery)
- Longitudinal tracking of response to voice therapy
- Multi-site clinical research where inter-rater reliability is impractical
- Feature engineering for machine learning models targeting pathological voice detection
The four metrics covered here — jitter, shimmer, HNR, and AVQI — are the most widely validated and cited in the clinical literature. They are computed natively by Praat (the de-facto phonetic analysis tool) and accessible from Python via the parselmouth binding.
2. Jitter — Cycle-to-Cycle Frequency Perturbation
Definition and Formula
Jitter measures the variability of the fundamental period (F0) from one glottal cycle to the next. A perfectly periodic voice would have identical successive periods; real voices deviate slightly. Pathological voices deviate a lot.
The most commonly reported variant is local relative jitter:
Jitter (local, %) = [Σ|T(i) - T(i+1)| / (N-1)] / [ΣT(i) / N] × 100
where:
T(i) = duration of the i-th glottal period (seconds)
N = total number of extracted periods
|·| = absolute value
Numerator: mean absolute difference between consecutive periods
Denominator: mean period (= 1 / mean F0)Other Praat jitter variants include RAP (relative average perturbation, 3-period smoothing) and PPQ5 (5-period perturbation quotient). Local jitter is the most sensitive but also the most susceptible to noise.
Clinical Ranges
| Category | Jitter (local, %) | Clinical interpretation |
|---|---|---|
| Normal | < 1.04 % | Stable glottal cycle duration |
| Borderline | 1.04 – 2.0 % | Mild perturbation; monitor |
| Pathological | > 2.0 % | Roughness, dysphonia; further evaluation needed |
High jitter is associated with vocal fold nodules, polyps, unilateral paralysis, Parkinson's disease, and functional dysphonia. It is also elevated transiently in voice fatigue after prolonged vocal effort.
Python / Praat Computation
import parselmouth
from parselmouth.praat import call
def compute_jitter(audio_path: str) -> dict:
"""Extract local jitter, RAP, and PPQ5 from a sustained vowel recording."""
snd = parselmouth.Sound(audio_path)
# Extract pitch (glottal period) with recommended clinical settings
pitch = call(snd, "To Pitch", 0.0, 75, 500) # time_step=0 (auto), f0_min=75 Hz, f0_max=500 Hz
# PointProcess: marks individual glottal pulse onsets
point_process = call([snd, pitch], "To PointProcess (cc)")
# Compute jitter variants
jitter_local = call(point_process, "Get jitter (local)", 0, 0, 0.0001, 0.02, 1.3)
jitter_rap = call(point_process, "Get jitter (rap)", 0, 0, 0.0001, 0.02, 1.3)
jitter_ppq5 = call(point_process, "Get jitter (ppq5)", 0, 0, 0.0001, 0.02, 1.3)
return {
"jitter_local_pct": jitter_local * 100, # convert to percent
"jitter_rap_pct": jitter_rap * 100,
"jitter_ppq5_pct": jitter_ppq5 * 100,
}
result = compute_jitter("sustained_a.wav")
print(f"Jitter (local): {result['jitter_local_pct']:.2f}%")Parameter notes:
The arguments to Get jitter are: time range (0,0 = whole file), minimum/maximum period (0.0001 s = 10 000 Hz, 0.02 s = 50 Hz), and maximum period factor (1.3). These guard against octave errors and aperiodic segments being misidentified as valid periods.
3. Shimmer — Cycle-to-Cycle Amplitude Perturbation
Definition and Formula
Shimmer captures instability in vocal fold vibration amplitude rather than timing. Where jitter measures irregularity in when each cycle occurs, shimmer measures variability in how strongly the folds close and re-open. The two most reported variants are shimmer (local, percent) and shimmer (local, dB):
Shimmer (local, %) = [Σ|A(i) - A(i+1)| / (N-1)] / [ΣA(i) / N] × 100
Shimmer (local, dB) = (1/(N-1)) × Σ 20 × log10(A(i+1) / A(i))
where:
A(i) = peak amplitude of the i-th glottal cycle
N = number of periodsClinical Ranges
| Category | Shimmer (local, %) | Shimmer (local, dB) |
|---|---|---|
| Normal | < 3.81 % | < 0.35 dB |
| Borderline | 3.81 – 6.0 % | 0.35 – 0.60 dB |
| Pathological | > 6.0 % | > 0.60 dB |
High shimmer is typically associated with breathiness and irregularity of vocal fold closure. It is elevated in cases of vocal fold edema, sulcus vocalis, and Reinke's edema. Both shimmer and jitter tend to co-increase in pathological voices, but their relative magnitudes can help differentiate voice disorders.
def compute_shimmer(audio_path: str) -> dict:
"""Extract shimmer (local %, dB, APQ3, APQ5) from a sustained vowel."""
snd = parselmouth.Sound(audio_path)
pitch = call(snd, "To Pitch", 0.0, 75, 500)
point_process = call([snd, pitch], "To PointProcess (cc)")
shimmer_local = call([snd, point_process], "Get shimmer (local)", 0, 0, 0.0001, 0.02, 1.3, 1.6)
shimmer_local_dB = call([snd, point_process], "Get shimmer (local_dB)", 0, 0, 0.0001, 0.02, 1.3, 1.6)
shimmer_apq3 = call([snd, point_process], "Get shimmer (apq3)", 0, 0, 0.0001, 0.02, 1.3, 1.6)
shimmer_apq5 = call([snd, point_process], "Get shimmer (apq5)", 0, 0, 0.0001, 0.02, 1.3, 1.6)
return {
"shimmer_local_pct": shimmer_local * 100,
"shimmer_local_dB": shimmer_local_dB,
"shimmer_apq3_pct": shimmer_apq3 * 100,
"shimmer_apq5_pct": shimmer_apq5 * 100,
}4. HNR — Harmonics-to-Noise Ratio
Definition
The Harmonics-to-Noise Ratio (HNR) expresses, in decibels, how much of the voice signal is periodic (harmonic) versus aperiodic (noise-like). It is computed via autocorrelation of the acoustic waveform:
HNR (dB) = 10 × log10(r / (1 - r))
where:
r = normalized autocorrelation peak at the fundamental period lag T0
Equivalently:
HNR = 10 × log10(Energy_harmonic / Energy_noise)
A perfect sine wave has r → 1 → HNR → +∞ dB
White noise has r ≈ 0 → HNR → -∞ dBClinical Ranges
| Category | HNR (dB) | Perceptual correlate |
|---|---|---|
| Normal | > 20 dB | Clear, modal phonation |
| Mild dysphonia | 12 – 20 dB | Slight breathiness or roughness |
| Moderate dysphonia | 7 – 12 dB | Clearly audible breathiness |
| Severe dysphonia | < 7 dB | Severely impaired phonation |
def compute_hnr(audio_path: str) -> float:
"""Compute mean HNR (dB) using Praat's cross-correlation method."""
snd = parselmouth.Sound(audio_path)
# "To Harmonicity (cc)" uses the cross-correlation autocorrelation method
harmonicity = call(snd, "To Harmonicity (cc)", 0.01, 75, 0.1, 1.0)
# time_step(s), min_pitch(Hz), silence_threshold, periods_per_window
hnr_mean = call(harmonicity, "Get mean", 0, 0) # 0,0 = full file
return hnr_mean
print(f"HNR: {compute_hnr('sustained_a.wav'):.1f} dB")HNR vs. NHR:
Some literature reports the inverse — NHR (Noise-to-Harmonics Ratio). A normal NHR is < 0.19 (linear scale). HNR and NHR convey the same information; always check which one a paper or tool reports before comparing values.
5. AVQI — Acoustic Voice Quality Index
What AVQI Is and Why It Exists
Jitter, shimmer, and HNR each capture one aspect of voice quality. A voice can have normal jitter but severely impaired HNR. The Acoustic Voice Quality Index (AVQI), developed by Maryn et al. and validated across multiple languages, collapses four complementary features into a single continuous scale that correlates strongly with expert perceptual ratings of overall dysphonia severity.
AVQI v03 Formula
AVQI = 3.284 + (0.181 × ShdB) + (−0.098 × HNR) + (−0.218 × CPP) + (0.008 × SpSlo)
where:
ShdB = shimmer (local, dB) [perturbation]
HNR = harmonics-to-noise ratio (dB) [periodic/aperiodic ratio]
CPP = cepstral peak prominence (dB) [spectral regularity of voicing]
SpSlo = spectral slope (dB/octave or linear fit) [spectral tilt, breathiness correlate]
Interpretation:
AVQI < 2.95 → normal voice quality
AVQI ≥ 2.95 → dysphonia likely; clinical evaluation recommendedCPP (Cepstral Peak Prominence) is the height of the dominant peak in the cepstrum relative to a regression line through the cepstral envelope. A strong CPP indicates well-defined periodicity at F0; low CPP characterises breathy or irregular voices. Spectral slope captures how steeply the harmonic energy falls off with frequency — breathiness tends to produce a steeper slope.
import numpy as np
import parselmouth
from parselmouth.praat import call
def compute_avqi(audio_path: str) -> float:
"""
Compute AVQI v03 (Maryn et al.) from a sustained vowel recording.
Returns a continuous score: < 2.95 = normal, >= 2.95 = dysphonic.
"""
snd = parselmouth.Sound(audio_path)
pitch = call(snd, "To Pitch", 0.0, 75, 500)
pp = call([snd, pitch], "To PointProcess (cc)")
# Shimmer (local, dB)
sh_db = call([snd, pp], "Get shimmer (local_dB)", 0, 0, 0.0001, 0.02, 1.3, 1.6)
# HNR
harmonicity = call(snd, "To Harmonicity (cc)", 0.01, 75, 0.1, 1.0)
hnr = call(harmonicity, "Get mean", 0, 0)
# CPP — via Praat's PowerCepstrogram
pc = call(snd, "To PowerCepstrogram", 60, 0.002, 5000, 50)
cpp = call(pc, "Get CPPS", "yes", 0.02, 0.0, 60, 330, 0.05, "Parabolic", 0.001, 0, "Straight", "Robust")
# Spectral slope — linear regression over log-frequency spectrum (0–4 kHz)
spectrum = call(snd, "To Spectrum", "yes")
sp_slo = call(spectrum, "Get slope", 0, 4000, 4000, 8000, "Energy")
avqi = 3.284 + (0.181 * sh_db) + (-0.098 * hnr) + (-0.218 * cpp) + (0.008 * sp_slo)
return avqi
score = compute_avqi("sustained_a.wav")
print(f"AVQI: {score:.2f} — {'dysphonic' if score >= 2.95 else 'normal range'}")Validation note:
AVQI v03 was validated in Dutch, English, Portuguese, and Korean, consistently achieving AUC > 0.90 for distinguishing normal from dysphonic voices. The 2.95 cut-off was derived against expert GRBAS ratings; some labs use 2.97 or a language-specific variant. Always report which AVQI version and cut-off you used.
6. How Recording Quality Affects Every Metric
All four metrics — jitter, shimmer, HNR, and AVQI — are computed from the raw audio waveform. Any artefact in that waveform propagates directly into the extracted features. The two most common culprits in clinical recordings are an inadequate recording level and background noise.
Recording Level: The dBFS Window
For reliable biomarker extraction, the recommended peak recording level during sustained phonation is between −20 dBFS and −12 dBFS. This window balances two opposing risks:
- Too quiet (below −40 dBFS peak): The microphone self-noise floor dominates. Jitter and shimmer are inflated by noise-induced period-detection errors. HNR is artificially lowered because the aperiodic noise energy rivals the harmonic energy of the voice.
- Too loud (above 0 dBFS): Clipping introduces hard nonlinearities into the waveform. The resulting flat-topped peaks look like amplitude perturbation, inflating shimmer. The harmonic structure is destroyed, crushing HNR. A clipped recording cannot be rescued in post-processing.
Understanding the dBFS scale and how it relates to physical sound pressure is a prerequisite for setting up a valid clinical recording protocol. If you are new to this distinction, our article on dB SPL vs. dBFS — and how to convert between them covers the calibration workflow in detail.
Minimum SNR Requirements
| Metric | Minimum SNR for reliable extraction | What fails below threshold |
|---|---|---|
| Jitter | ≥ 30 dB SNR | Noise causes false period boundaries → jitter inflated |
| Shimmer | ≥ 25 dB SNR | Peak amplitude estimates corrupted → shimmer inflated |
| HNR | ≥ 20 dB SNR | Aperiodic noise floor reduces apparent harmonic energy |
| AVQI | ≥ 30 dB SNR | Propagated errors in ShdB + HNR corrupt composite score |
For deep learning models that operate on the same recordings — such as Wav2Vec2-based pathology classifiers — input normalisation and noise floor management are equally critical. Our Wav2Vec2 clinical audio guide covers the preprocessing chain for ML pipelines built on pathological voice data.
7. Practical Reference: Metrics at a Glance
| Metric | Praat command | Normal range | Pathological threshold | Python library |
|---|---|---|---|---|
| Jitter (local) | Get jitter (local) | < 1.04 % | > 2.0 % | parselmouth |
| Shimmer (local, dB) | Get shimmer (local_dB) | < 0.35 dB | > 0.60 dB | parselmouth |
| HNR | To Harmonicity (cc) | > 20 dB | < 12 dB | parselmouth |
| CPP | Get CPPS | > 12 dB | < 7 dB | parselmouth |
| AVQI v03 | Composite formula | < 2.95 | ≥ 2.95 | parselmouth + numpy |
Frequently Asked Questions
What is the normal range for jitter in voice analysis?
Normal local relative jitter is below 1.04% in healthy adult voices. Values above this suggest abnormal cycle-to-cycle frequency instability, with values above 2% considered clearly pathological. Praat also reports RAP and PPQ5 variants; the 1% local jitter threshold is the most widely cited in clinical literature for dysphonia screening.
What does a low HNR value mean clinically?
A low HNR indicates that a large proportion of the voice signal is aperiodic noise rather than harmonic energy. HNR below 20 dB during sustained phonation is associated with breathiness, roughness, or hoarseness. Values below 7 dB suggest severe dysphonia. HNR can also be reduced artifactually by a noisy recording environment — always verify recording SNR before interpreting low HNR as clinical evidence.
What is the AVQI and how is it interpreted?
The Acoustic Voice Quality Index (AVQI v03) is a composite score combining shimmer (dB), HNR, Cepstral Peak Prominence, and spectral slope: AVQI = 3.284 + (0.181 × ShdB) − (0.098 × HNR) − (0.218 × CPP) + (0.008 × SpSlo). A score below 2.95 indicates normal voice quality; at or above 2.95 suggests dysphonia. The index has been validated across Dutch, English, Portuguese, and Korean with AUC > 0.90.
How does recording level in dBFS affect jitter, shimmer, and HNR?
Recording level directly impacts metric reliability. A signal below −40 dBFS peak buries the voice in microphone noise, inflating jitter and shimmer through false period detection while dragging HNR down. A clipped signal (above 0 dBFS) introduces waveform distortion that mimics pathological amplitude perturbation. The safe window for clinical voice recording is a peak between −20 dBFS and −12 dBFS.
Automate These Metrics in Your Clinical Workflow
Computing jitter, shimmer, HNR, and AVQI manually for every patient is time-consuming and error-prone. Vocametrix is a platform that automates all of these measurements — and more — from a single recorded audio file. It handles recording level validation, period detection, cepstral analysis, and AVQI scoring through a single API call, with results interpretable by speech-language pathologists and ML practitioners alike.