Clinical Audio
June 20269 min read

Voice Quality Metrics: Jitter, Shimmer, HNR, and AVQI Explained

What you'll learn: precise definitions and computation formulas for the four core acoustic voice biomarkers — jitter, shimmer, HNR, and AVQI — how to extract them in Python using parselmouth/Praat, their clinical interpretation ranges, and how recording quality directly affects the reliability of every metric.

A clinician listens to a patient sustain the vowel /a/ for three seconds and describes the voice as "rough and slightly breathy." An acoustic analysis of the same recording reports a jitter of 1.8%, a shimmer of 4.2 dB, an HNR of 14.3 dB, and an AVQI of 3.7. These two descriptions encode the same perceptual reality — but the numbers are reproducible, comparable across clinics, and trackable over the course of therapy.

This article explains exactly what those four numbers mean, how they are computed from the raw audio waveform, what their normal and pathological ranges are, and why the recording conditions (level, noise floor, clipping) can silently corrupt them if you are not careful.

1. Why Voice Quality Metrics Matter

Dysphonia — any voice disorder that impairs the normal production of speech — affects an estimated 7% of the general population and a much higher proportion of professional voice users and patients with neurological conditions. Assessing dysphonia traditionally relied on perceptual scales such as the GRBAS (Grade, Roughness, Breathiness, Asthenia, Strain) or the VHI (Voice Handicap Index) questionnaire. Both are subjective and rater-dependent.

Acoustic biomarkers extracted from a sustained phonation recording — typically the vowel /a/ held for 3–5 seconds at a comfortable loudness level and microphone distance of 30 cm — provide an objective, repeatable complement to perceptual scales. They are used for:

  • Initial screening and severity quantification of dysphonia
  • Pre- and post-surgical voice assessment (e.g., vocal fold surgery)
  • Longitudinal tracking of response to voice therapy
  • Multi-site clinical research where inter-rater reliability is impractical
  • Feature engineering for machine learning models targeting pathological voice detection

The four metrics covered here — jitter, shimmer, HNR, and AVQI — are the most widely validated and cited in the clinical literature. They are computed natively by Praat (the de-facto phonetic analysis tool) and accessible from Python via the parselmouth binding.

2. Jitter — Cycle-to-Cycle Frequency Perturbation

Definition and Formula

Jitter measures the variability of the fundamental period (F0) from one glottal cycle to the next. A perfectly periodic voice would have identical successive periods; real voices deviate slightly. Pathological voices deviate a lot.

The most commonly reported variant is local relative jitter:

Jitter (local, %) = [Σ|T(i) - T(i+1)| / (N-1)] / [ΣT(i) / N] × 100

where:
  T(i)  = duration of the i-th glottal period (seconds)
  N     = total number of extracted periods
  |·|   = absolute value

Numerator:   mean absolute difference between consecutive periods
Denominator: mean period (= 1 / mean F0)

Other Praat jitter variants include RAP (relative average perturbation, 3-period smoothing) and PPQ5 (5-period perturbation quotient). Local jitter is the most sensitive but also the most susceptible to noise.

Clinical Ranges

CategoryJitter (local, %)Clinical interpretation
Normal< 1.04 %Stable glottal cycle duration
Borderline1.04 – 2.0 %Mild perturbation; monitor
Pathological> 2.0 %Roughness, dysphonia; further evaluation needed

High jitter is associated with vocal fold nodules, polyps, unilateral paralysis, Parkinson's disease, and functional dysphonia. It is also elevated transiently in voice fatigue after prolonged vocal effort.

Python / Praat Computation

import parselmouth
from parselmouth.praat import call

def compute_jitter(audio_path: str) -> dict:
    """Extract local jitter, RAP, and PPQ5 from a sustained vowel recording."""
    snd = parselmouth.Sound(audio_path)

    # Extract pitch (glottal period) with recommended clinical settings
    pitch = call(snd, "To Pitch", 0.0, 75, 500)  # time_step=0 (auto), f0_min=75 Hz, f0_max=500 Hz

    # PointProcess: marks individual glottal pulse onsets
    point_process = call([snd, pitch], "To PointProcess (cc)")

    # Compute jitter variants
    jitter_local  = call(point_process, "Get jitter (local)",  0, 0, 0.0001, 0.02, 1.3)
    jitter_rap    = call(point_process, "Get jitter (rap)",    0, 0, 0.0001, 0.02, 1.3)
    jitter_ppq5   = call(point_process, "Get jitter (ppq5)",  0, 0, 0.0001, 0.02, 1.3)

    return {
        "jitter_local_pct": jitter_local * 100,  # convert to percent
        "jitter_rap_pct":   jitter_rap   * 100,
        "jitter_ppq5_pct":  jitter_ppq5  * 100,
    }

result = compute_jitter("sustained_a.wav")
print(f"Jitter (local): {result['jitter_local_pct']:.2f}%")

Parameter notes:

The arguments to Get jitter are: time range (0,0 = whole file), minimum/maximum period (0.0001 s = 10 000 Hz, 0.02 s = 50 Hz), and maximum period factor (1.3). These guard against octave errors and aperiodic segments being misidentified as valid periods.

3. Shimmer — Cycle-to-Cycle Amplitude Perturbation

Definition and Formula

Shimmer captures instability in vocal fold vibration amplitude rather than timing. Where jitter measures irregularity in when each cycle occurs, shimmer measures variability in how strongly the folds close and re-open. The two most reported variants are shimmer (local, percent) and shimmer (local, dB):

Shimmer (local, %) = [Σ|A(i) - A(i+1)| / (N-1)] / [ΣA(i) / N] × 100

Shimmer (local, dB) = (1/(N-1)) × Σ 20 × log10(A(i+1) / A(i))

where:
  A(i)  = peak amplitude of the i-th glottal cycle
  N     = number of periods

Clinical Ranges

CategoryShimmer (local, %)Shimmer (local, dB)
Normal< 3.81 %< 0.35 dB
Borderline3.81 – 6.0 %0.35 – 0.60 dB
Pathological> 6.0 %> 0.60 dB

High shimmer is typically associated with breathiness and irregularity of vocal fold closure. It is elevated in cases of vocal fold edema, sulcus vocalis, and Reinke's edema. Both shimmer and jitter tend to co-increase in pathological voices, but their relative magnitudes can help differentiate voice disorders.

def compute_shimmer(audio_path: str) -> dict:
    """Extract shimmer (local %, dB, APQ3, APQ5) from a sustained vowel."""
    snd = parselmouth.Sound(audio_path)
    pitch = call(snd, "To Pitch", 0.0, 75, 500)
    point_process = call([snd, pitch], "To PointProcess (cc)")

    shimmer_local    = call([snd, point_process], "Get shimmer (local)",    0, 0, 0.0001, 0.02, 1.3, 1.6)
    shimmer_local_dB = call([snd, point_process], "Get shimmer (local_dB)", 0, 0, 0.0001, 0.02, 1.3, 1.6)
    shimmer_apq3     = call([snd, point_process], "Get shimmer (apq3)",     0, 0, 0.0001, 0.02, 1.3, 1.6)
    shimmer_apq5     = call([snd, point_process], "Get shimmer (apq5)",     0, 0, 0.0001, 0.02, 1.3, 1.6)

    return {
        "shimmer_local_pct": shimmer_local    * 100,
        "shimmer_local_dB":  shimmer_local_dB,
        "shimmer_apq3_pct":  shimmer_apq3     * 100,
        "shimmer_apq5_pct":  shimmer_apq5     * 100,
    }

4. HNR — Harmonics-to-Noise Ratio

Definition

The Harmonics-to-Noise Ratio (HNR) expresses, in decibels, how much of the voice signal is periodic (harmonic) versus aperiodic (noise-like). It is computed via autocorrelation of the acoustic waveform:

HNR (dB) = 10 × log10(r / (1 - r))

where:
  r = normalized autocorrelation peak at the fundamental period lag T0

Equivalently:
  HNR = 10 × log10(Energy_harmonic / Energy_noise)

A perfect sine wave has r → 1 → HNR → +∞ dB
White noise has        r ≈ 0 → HNR → -∞ dB

Clinical Ranges

CategoryHNR (dB)Perceptual correlate
Normal> 20 dBClear, modal phonation
Mild dysphonia12 – 20 dBSlight breathiness or roughness
Moderate dysphonia7 – 12 dBClearly audible breathiness
Severe dysphonia< 7 dBSeverely impaired phonation
def compute_hnr(audio_path: str) -> float:
    """Compute mean HNR (dB) using Praat's cross-correlation method."""
    snd = parselmouth.Sound(audio_path)

    # "To Harmonicity (cc)" uses the cross-correlation autocorrelation method
    harmonicity = call(snd, "To Harmonicity (cc)", 0.01, 75, 0.1, 1.0)
    #                      time_step(s), min_pitch(Hz), silence_threshold, periods_per_window

    hnr_mean = call(harmonicity, "Get mean", 0, 0)  # 0,0 = full file
    return hnr_mean

print(f"HNR: {compute_hnr('sustained_a.wav'):.1f} dB")

HNR vs. NHR:

Some literature reports the inverse — NHR (Noise-to-Harmonics Ratio). A normal NHR is < 0.19 (linear scale). HNR and NHR convey the same information; always check which one a paper or tool reports before comparing values.

5. AVQI — Acoustic Voice Quality Index

What AVQI Is and Why It Exists

Jitter, shimmer, and HNR each capture one aspect of voice quality. A voice can have normal jitter but severely impaired HNR. The Acoustic Voice Quality Index (AVQI), developed by Maryn et al. and validated across multiple languages, collapses four complementary features into a single continuous scale that correlates strongly with expert perceptual ratings of overall dysphonia severity.

AVQI v03 Formula

AVQI = 3.284 + (0.181 × ShdB) + (−0.098 × HNR) + (−0.218 × CPP) + (0.008 × SpSlo)

where:
  ShdB   = shimmer (local, dB)                       [perturbation]
  HNR    = harmonics-to-noise ratio (dB)              [periodic/aperiodic ratio]
  CPP    = cepstral peak prominence (dB)              [spectral regularity of voicing]
  SpSlo  = spectral slope (dB/octave or linear fit)   [spectral tilt, breathiness correlate]

Interpretation:
  AVQI < 2.95  →  normal voice quality
  AVQI ≥ 2.95  →  dysphonia likely; clinical evaluation recommended

CPP (Cepstral Peak Prominence) is the height of the dominant peak in the cepstrum relative to a regression line through the cepstral envelope. A strong CPP indicates well-defined periodicity at F0; low CPP characterises breathy or irregular voices. Spectral slope captures how steeply the harmonic energy falls off with frequency — breathiness tends to produce a steeper slope.

import numpy as np
import parselmouth
from parselmouth.praat import call

def compute_avqi(audio_path: str) -> float:
    """
    Compute AVQI v03 (Maryn et al.) from a sustained vowel recording.
    Returns a continuous score: < 2.95 = normal, >= 2.95 = dysphonic.
    """
    snd = parselmouth.Sound(audio_path)
    pitch = call(snd, "To Pitch", 0.0, 75, 500)
    pp    = call([snd, pitch], "To PointProcess (cc)")

    # Shimmer (local, dB)
    sh_db = call([snd, pp], "Get shimmer (local_dB)", 0, 0, 0.0001, 0.02, 1.3, 1.6)

    # HNR
    harmonicity = call(snd, "To Harmonicity (cc)", 0.01, 75, 0.1, 1.0)
    hnr = call(harmonicity, "Get mean", 0, 0)

    # CPP — via Praat's PowerCepstrogram
    pc  = call(snd, "To PowerCepstrogram", 60, 0.002, 5000, 50)
    cpp = call(pc,  "Get CPPS", "yes", 0.02, 0.0, 60, 330, 0.05, "Parabolic", 0.001, 0, "Straight", "Robust")

    # Spectral slope — linear regression over log-frequency spectrum (0–4 kHz)
    spectrum = call(snd, "To Spectrum", "yes")
    sp_slo   = call(spectrum, "Get slope", 0, 4000, 4000, 8000, "Energy")

    avqi = 3.284 + (0.181 * sh_db) + (-0.098 * hnr) + (-0.218 * cpp) + (0.008 * sp_slo)
    return avqi

score = compute_avqi("sustained_a.wav")
print(f"AVQI: {score:.2f} — {'dysphonic' if score >= 2.95 else 'normal range'}")

Validation note:

AVQI v03 was validated in Dutch, English, Portuguese, and Korean, consistently achieving AUC > 0.90 for distinguishing normal from dysphonic voices. The 2.95 cut-off was derived against expert GRBAS ratings; some labs use 2.97 or a language-specific variant. Always report which AVQI version and cut-off you used.

6. How Recording Quality Affects Every Metric

All four metrics — jitter, shimmer, HNR, and AVQI — are computed from the raw audio waveform. Any artefact in that waveform propagates directly into the extracted features. The two most common culprits in clinical recordings are an inadequate recording level and background noise.

Recording Level: The dBFS Window

For reliable biomarker extraction, the recommended peak recording level during sustained phonation is between −20 dBFS and −12 dBFS. This window balances two opposing risks:

  • Too quiet (below −40 dBFS peak): The microphone self-noise floor dominates. Jitter and shimmer are inflated by noise-induced period-detection errors. HNR is artificially lowered because the aperiodic noise energy rivals the harmonic energy of the voice.
  • Too loud (above 0 dBFS): Clipping introduces hard nonlinearities into the waveform. The resulting flat-topped peaks look like amplitude perturbation, inflating shimmer. The harmonic structure is destroyed, crushing HNR. A clipped recording cannot be rescued in post-processing.

Understanding the dBFS scale and how it relates to physical sound pressure is a prerequisite for setting up a valid clinical recording protocol. If you are new to this distinction, our article on dB SPL vs. dBFS — and how to convert between them covers the calibration workflow in detail.

Minimum SNR Requirements

MetricMinimum SNR for reliable extractionWhat fails below threshold
Jitter≥ 30 dB SNRNoise causes false period boundaries → jitter inflated
Shimmer≥ 25 dB SNRPeak amplitude estimates corrupted → shimmer inflated
HNR≥ 20 dB SNRAperiodic noise floor reduces apparent harmonic energy
AVQI≥ 30 dB SNRPropagated errors in ShdB + HNR corrupt composite score

For deep learning models that operate on the same recordings — such as Wav2Vec2-based pathology classifiers — input normalisation and noise floor management are equally critical. Our Wav2Vec2 clinical audio guide covers the preprocessing chain for ML pipelines built on pathological voice data.

7. Practical Reference: Metrics at a Glance

MetricPraat commandNormal rangePathological thresholdPython library
Jitter (local)Get jitter (local)< 1.04 %> 2.0 %parselmouth
Shimmer (local, dB)Get shimmer (local_dB)< 0.35 dB> 0.60 dBparselmouth
HNRTo Harmonicity (cc)> 20 dB< 12 dBparselmouth
CPPGet CPPS> 12 dB< 7 dBparselmouth
AVQI v03Composite formula< 2.95≥ 2.95parselmouth + numpy

Frequently Asked Questions

What is the normal range for jitter in voice analysis?

Normal local relative jitter is below 1.04% in healthy adult voices. Values above this suggest abnormal cycle-to-cycle frequency instability, with values above 2% considered clearly pathological. Praat also reports RAP and PPQ5 variants; the 1% local jitter threshold is the most widely cited in clinical literature for dysphonia screening.

What does a low HNR value mean clinically?

A low HNR indicates that a large proportion of the voice signal is aperiodic noise rather than harmonic energy. HNR below 20 dB during sustained phonation is associated with breathiness, roughness, or hoarseness. Values below 7 dB suggest severe dysphonia. HNR can also be reduced artifactually by a noisy recording environment — always verify recording SNR before interpreting low HNR as clinical evidence.

What is the AVQI and how is it interpreted?

The Acoustic Voice Quality Index (AVQI v03) is a composite score combining shimmer (dB), HNR, Cepstral Peak Prominence, and spectral slope: AVQI = 3.284 + (0.181 × ShdB) − (0.098 × HNR) − (0.218 × CPP) + (0.008 × SpSlo). A score below 2.95 indicates normal voice quality; at or above 2.95 suggests dysphonia. The index has been validated across Dutch, English, Portuguese, and Korean with AUC > 0.90.

How does recording level in dBFS affect jitter, shimmer, and HNR?

Recording level directly impacts metric reliability. A signal below −40 dBFS peak buries the voice in microphone noise, inflating jitter and shimmer through false period detection while dragging HNR down. A clipped signal (above 0 dBFS) introduces waveform distortion that mimics pathological amplitude perturbation. The safe window for clinical voice recording is a peak between −20 dBFS and −12 dBFS.

Automate These Metrics in Your Clinical Workflow

Computing jitter, shimmer, HNR, and AVQI manually for every patient is time-consuming and error-prone. Vocametrix is a platform that automates all of these measurements — and more — from a single recorded audio file. It handles recording level validation, period detection, cepstral analysis, and AVQI scoring through a single API call, with results interpretable by speech-language pathologists and ML practitioners alike.

Related Articles

Audio DSP

dB SPL vs dBFS: Understanding Audio Levels for Engineers and Clinicians

dB SPL vs dBFS explained: conversion formula, calibration workflow, and why the distinction matters for clinical voice analysis and audio ML pipelines.

Read more →
Machine Learning

Fine-Tuning Audio ML Models: Plan, Audit, Diagnose

A structured framework for audio ML fine-tuning: how to plan a new pipeline, audit an existing script, and diagnose an underperforming model.

Read more →
MLOps

MLOps Setup – OVHcloud + DVC + MLflow (Audio Dataset)

A practical guide to a reproducible MLOps pipeline for audio data: OVHcloud Object Storage, DVC for dataset versioning, and MLflow for tracking.

Read more →