Voice Quality Metrics: Jitter, Shimmer, HNR, and AVQI Explained

Q: What is the normal range for jitter in voice analysis?

Normal jitter (local, relative) is typically below 1.04% for healthy adult voices. Values above 1.04% suggest abnormal cycle-to-cycle frequency instability and may indicate vocal pathology such as dysphonia, vocal fold nodules, or neurological involvement. Praat uses a slightly different threshold depending on the perturbation quotient type (jitter local, rap, ppq5), but 1% relative jitter is the most widely cited clinical cut-off.

Q: What does a low HNR value mean clinically?

A low Harmonics-to-Noise Ratio (HNR) indicates that a larger proportion of the voice signal is aperiodic noise rather than harmonic energy. HNR below 20 dB in sustained phonation is generally associated with breathiness, roughness, or hoarseness. Values below 7 dB suggest severe dysphonia. Low HNR can result from vocal fold pathology (paralysis, edema, nodules), functional disorders, or poor recording quality such as low signal-to-noise ratio in the recording environment.

Q: What is the AVQI and how is it interpreted?

The Acoustic Voice Quality Index (AVQI) is a composite acoustic biomarker that combines shimmer (dB), Harmonics-to-Noise Ratio, Cepstral Peak Prominence, and spectral slope to give a single continuous score of voice quality. AVQI v03 uses the formula: AVQI = 3.284 + (0.181 × ShdB) + (−0.098 × HNR) + (−0.218 × CPP) + (0.008 × SpSlo). A score below 2.95 is associated with normal voice quality; scores above 2.95 suggest dysphonia. The index has been validated across multiple languages and recording conditions.

Q: How does recording level in dBFS affect jitter, shimmer, and HNR?

Recording level directly impacts the reliability of all perturbation and noise-based voice metrics. If the signal is too quiet (below −40 dBFS peak), microphone self-noise dominates and inflates jitter, shimmer, and HNR errors. If the signal clips (above 0 dBFS), waveform nonlinearity introduces artificial perturbation that makes jitter and shimmer appear pathological even in a healthy voice. The recommended recording level for clinical voice analysis is a peak between −20 and −12 dBFS, ensuring adequate SNR while preserving headroom.

What you'll learn: precise definitions and computation formulas for the four core acoustic voice biomarkers — jitter, shimmer, HNR, and AVQI — how to extract them in Python using parselmouth/Praat, their clinical interpretation ranges, and how recording quality directly affects the reliability of every metric.

A clinician listens to a patient sustain the vowel /a/ for three seconds and describes the voice as "rough and slightly breathy." An acoustic analysis of the same recording reports a jitter of 1.8%, a shimmer of 4.2 dB, an HNR of 14.3 dB, and an AVQI of 3.7. These two descriptions encode the same perceptual reality — but the numbers are reproducible, comparable across clinics, and trackable over the course of therapy.

This article explains exactly what those four numbers mean, how they are computed from the raw audio waveform, what their normal and pathological ranges are, and why the recording conditions (level, noise floor, clipping) can silently corrupt them if you are not careful.

1. Why Voice Quality Metrics Matter

Dysphonia — any voice disorder that impairs the normal production of speech — affects an estimated 7% of the general population and a much higher proportion of professional voice users and patients with neurological conditions. Assessing dysphonia traditionally relied on perceptual scales such as the GRBAS (Grade, Roughness, Breathiness, Asthenia, Strain) or the VHI (Voice Handicap Index) questionnaire. Both are subjective and rater-dependent.

Acoustic biomarkers extracted from a sustained phonation recording — typically the vowel /a/ held for 3–5 seconds at a comfortable loudness level and microphone distance of 30 cm — provide an objective, repeatable complement to perceptual scales. They are used for:

Initial screening and severity quantification of dysphonia
Pre- and post-surgical voice assessment (e.g., vocal fold surgery)
Longitudinal tracking of response to voice therapy
Multi-site clinical research where inter-rater reliability is impractical
Feature engineering for machine learning models targeting pathological voice detection

The four metrics covered here — jitter, shimmer, HNR, and AVQI — are the most widely validated and cited in the clinical literature. They are computed natively by Praat (the de-facto phonetic analysis tool) and accessible from Python via the parselmouth binding.

2. Jitter — Cycle-to-Cycle Frequency Perturbation

Definition and Formula

Jitter measures the variability of the fundamental period (F0) from one glottal cycle to the next. A perfectly periodic voice would have identical successive periods; real voices deviate slightly. Pathological voices deviate a lot.

The most commonly reported variant is local relative jitter:

Jitter (local, %) = [Σ|T(i) - T(i+1)| / (N-1)] / [ΣT(i) / N] × 100

where:
  T(i)  = duration of the i-th glottal period (seconds)
  N     = total number of extracted periods
  |·|   = absolute value

Numerator:   mean absolute difference between consecutive periods
Denominator: mean period (= 1 / mean F0)

Other Praat jitter variants include RAP (relative average perturbation, 3-period smoothing) and PPQ5 (5-period perturbation quotient). Local jitter is the most sensitive but also the most susceptible to noise.

Clinical Ranges

Category	Jitter (local, %)	Clinical interpretation
Normal	< 1.04 %	Stable glottal cycle duration
Borderline	1.04 – 2.0 %	Mild perturbation; monitor
Pathological	> 2.0 %	Roughness, dysphonia; further evaluation needed

High jitter is associated with vocal fold nodules, polyps, unilateral paralysis, Parkinson's disease, and functional dysphonia. It is also elevated transiently in voice fatigue after prolonged vocal effort.

Python / Praat Computation

import parselmouth
from parselmouth.praat import call

def compute_jitter(audio_path: str) -> dict:
    """Extract local jitter, RAP, and PPQ5 from a sustained vowel recording."""
    snd = parselmouth.Sound(audio_path)

    # Extract pitch (glottal period) with recommended clinical settings
    pitch = call(snd, "To Pitch", 0.0, 75, 500)  # time_step=0 (auto), f0_min=75 Hz, f0_max=500 Hz

    # PointProcess: marks individual glottal pulse onsets
    point_process = call([snd, pitch], "To PointProcess (cc)")

    # Compute jitter variants
    jitter_local  = call(point_process, "Get jitter (local)",  0, 0, 0.0001, 0.02, 1.3)
    jitter_rap    = call(point_process, "Get jitter (rap)",    0, 0, 0.0001, 0.02, 1.3)
    jitter_ppq5   = call(point_process, "Get jitter (ppq5)",  0, 0, 0.0001, 0.02, 1.3)

    return {
        "jitter_local_pct": jitter_local * 100,  # convert to percent
        "jitter_rap_pct":   jitter_rap   * 100,
        "jitter_ppq5_pct":  jitter_ppq5  * 100,
    }

result = compute_jitter("sustained_a.wav")
print(f"Jitter (local): {result['jitter_local_pct']:.2f}%")

Parameter notes:

The arguments to Get jitter are: time range (0,0 = whole file), minimum/maximum period (0.0001 s = 10 000 Hz, 0.02 s = 50 Hz), and maximum period factor (1.3). These guard against octave errors and aperiodic segments being misidentified as valid periods.

3. Shimmer — Cycle-to-Cycle Amplitude Perturbation

Definition and Formula

Shimmer captures instability in vocal fold vibration amplitude rather than timing. Where jitter measures irregularity in when each cycle occurs, shimmer measures variability in how strongly the folds close and re-open. The two most reported variants are shimmer (local, percent) and shimmer (local, dB):

Shimmer (local, %) = [Σ|A(i) - A(i+1)| / (N-1)] / [ΣA(i) / N] × 100

Shimmer (local, dB) = (1/(N-1)) × Σ 20 × log10(A(i+1) / A(i))

where:
  A(i)  = peak amplitude of the i-th glottal cycle
  N     = number of periods

Clinical Ranges

Category	Shimmer (local, %)	Shimmer (local, dB)
Normal	< 3.81 %	< 0.35 dB
Borderline	3.81 – 6.0 %	0.35 – 0.60 dB
Pathological	> 6.0 %	> 0.60 dB

High shimmer is typically associated with breathiness and irregularity of vocal fold closure. It is elevated in cases of vocal fold edema, sulcus vocalis, and Reinke's edema. Both shimmer and jitter tend to co-increase in pathological voices, but their relative magnitudes can help differentiate voice disorders.

def compute_shimmer(audio_path: str) -> dict:
    """Extract shimmer (local %, dB, APQ3, APQ5) from a sustained vowel."""
    snd = parselmouth.Sound(audio_path)
    pitch = call(snd, "To Pitch", 0.0, 75, 500)
    point_process = call([snd, pitch], "To PointProcess (cc)")

    shimmer_local    = call([snd, point_process], "Get shimmer (local)",    0, 0, 0.0001, 0.02, 1.3, 1.6)
    shimmer_local_dB = call([snd, point_process], "Get shimmer (local_dB)", 0, 0, 0.0001, 0.02, 1.3, 1.6)
    shimmer_apq3     = call([snd, point_process], "Get shimmer (apq3)",     0, 0, 0.0001, 0.02, 1.3, 1.6)
    shimmer_apq5     = call([snd, point_process], "Get shimmer (apq5)",     0, 0, 0.0001, 0.02, 1.3, 1.6)

    return {
        "shimmer_local_pct": shimmer_local    * 100,
        "shimmer_local_dB":  shimmer_local_dB,
        "shimmer_apq3_pct":  shimmer_apq3     * 100,
        "shimmer_apq5_pct":  shimmer_apq5     * 100,
    }

4. HNR — Harmonics-to-Noise Ratio

Definition

The Harmonics-to-Noise Ratio (HNR) expresses, in decibels, how much of the voice signal is periodic (harmonic) versus aperiodic (noise-like). It is computed via autocorrelation of the acoustic waveform:

HNR (dB) = 10 × log10(r / (1 - r))

where:
  r = normalized autocorrelation peak at the fundamental period lag T0

Equivalently:
  HNR = 10 × log10(Energy_harmonic / Energy_noise)

A perfect sine wave has r → 1 → HNR → +∞ dB
White noise has        r ≈ 0 → HNR → -∞ dB

Clinical Ranges

Category	HNR (dB)	Perceptual correlate
Normal	> 20 dB	Clear, modal phonation
Mild dysphonia	12 – 20 dB	Slight breathiness or roughness
Moderate dysphonia	7 – 12 dB	Clearly audible breathiness
Severe dysphonia	< 7 dB	Severely impaired phonation

def compute_hnr(audio_path: str) -> float:
    """Compute mean HNR (dB) using Praat's cross-correlation method."""
    snd = parselmouth.Sound(audio_path)

    # "To Harmonicity (cc)" uses the cross-correlation autocorrelation method
    harmonicity = call(snd, "To Harmonicity (cc)", 0.01, 75, 0.1, 1.0)
    #                      time_step(s), min_pitch(Hz), silence_threshold, periods_per_window

    hnr_mean = call(harmonicity, "Get mean", 0, 0)  # 0,0 = full file
    return hnr_mean

print(f"HNR: {compute_hnr('sustained_a.wav'):.1f} dB")

HNR vs. NHR:

Some literature reports the inverse — NHR (Noise-to-Harmonics Ratio). A normal NHR is < 0.19 (linear scale). HNR and NHR convey the same information; always check which one a paper or tool reports before comparing values.

5. AVQI — Acoustic Voice Quality Index

What AVQI Is and Why It Exists

Jitter, shimmer, and HNR each capture one aspect of voice quality. A voice can have normal jitter but severely impaired HNR. The Acoustic Voice Quality Index (AVQI), developed by Maryn et al. and validated across multiple languages, collapses four complementary features into a single continuous scale that correlates strongly with expert perceptual ratings of overall dysphonia severity.

AVQI v03 Formula

AVQI = 3.284 + (0.181 × ShdB) + (−0.098 × HNR) + (−0.218 × CPP) + (0.008 × SpSlo)

where:
  ShdB   = shimmer (local, dB)                       [perturbation]
  HNR    = harmonics-to-noise ratio (dB)              [periodic/aperiodic ratio]
  CPP    = cepstral peak prominence (dB)              [spectral regularity of voicing]
  SpSlo  = spectral slope (dB/octave or linear fit)   [spectral tilt, breathiness correlate]

Interpretation:
  AVQI < 2.95  →  normal voice quality
  AVQI ≥ 2.95  →  dysphonia likely; clinical evaluation recommended

CPP (Cepstral Peak Prominence) is the height of the dominant peak in the cepstrum relative to a regression line through the cepstral envelope. A strong CPP indicates well-defined periodicity at F0; low CPP characterises breathy or irregular voices. Spectral slope captures how steeply the harmonic energy falls off with frequency — breathiness tends to produce a steeper slope.

import numpy as np
import parselmouth
from parselmouth.praat import call

def compute_avqi(audio_path: str) -> float:
    """
    Compute AVQI v03 (Maryn et al.) from a sustained vowel recording.
    Returns a continuous score: < 2.95 = normal, >= 2.95 = dysphonic.
    """
    snd = parselmouth.Sound(audio_path)
    pitch = call(snd, "To Pitch", 0.0, 75, 500)
    pp    = call([snd, pitch], "To PointProcess (cc)")

    # Shimmer (local, dB)
    sh_db = call([snd, pp], "Get shimmer (local_dB)", 0, 0, 0.0001, 0.02, 1.3, 1.6)

    # HNR
    harmonicity = call(snd, "To Harmonicity (cc)", 0.01, 75, 0.1, 1.0)
    hnr = call(harmonicity, "Get mean", 0, 0)

    # CPP — via Praat's PowerCepstrogram
    pc  = call(snd, "To PowerCepstrogram", 60, 0.002, 5000, 50)
    cpp = call(pc,  "Get CPPS", "yes", 0.02, 0.0, 60, 330, 0.05, "Parabolic", 0.001, 0, "Straight", "Robust")

    # Spectral slope — linear regression over log-frequency spectrum (0–4 kHz)
    spectrum = call(snd, "To Spectrum", "yes")
    sp_slo   = call(spectrum, "Get slope", 0, 4000, 4000, 8000, "Energy")

    avqi = 3.284 + (0.181 * sh_db) + (-0.098 * hnr) + (-0.218 * cpp) + (0.008 * sp_slo)
    return avqi

score = compute_avqi("sustained_a.wav")
print(f"AVQI: {score:.2f} — {'dysphonic' if score >= 2.95 else 'normal range'}")

Validation note:

AVQI v03 was validated in Dutch, English, Portuguese, and Korean, consistently achieving AUC > 0.90 for distinguishing normal from dysphonic voices. The 2.95 cut-off was derived against expert GRBAS ratings; some labs use 2.97 or a language-specific variant. Always report which AVQI version and cut-off you used.

6. How Recording Quality Affects Every Metric

All four metrics — jitter, shimmer, HNR, and AVQI — are computed from the raw audio waveform. Any artefact in that waveform propagates directly into the extracted features. The two most common culprits in clinical recordings are an inadequate recording level and background noise.

Recording Level: The dBFS Window

For reliable biomarker extraction, the recommended peak recording level during sustained phonation is between −20 dBFS and −12 dBFS. This window balances two opposing risks:

Too quiet (below −40 dBFS peak): The microphone self-noise floor dominates. Jitter and shimmer are inflated by noise-induced period-detection errors. HNR is artificially lowered because the aperiodic noise energy rivals the harmonic energy of the voice.
Too loud (above 0 dBFS): Clipping introduces hard nonlinearities into the waveform. The resulting flat-topped peaks look like amplitude perturbation, inflating shimmer. The harmonic structure is destroyed, crushing HNR. A clipped recording cannot be rescued in post-processing.

Understanding the dBFS scale and how it relates to physical sound pressure is a prerequisite for setting up a valid clinical recording protocol. If you are new to this distinction, our article on dB SPL vs. dBFS — and how to convert between them covers the calibration workflow in detail.

Minimum SNR Requirements

Metric	Minimum SNR for reliable extraction	What fails below threshold
Jitter	≥ 30 dB SNR	Noise causes false period boundaries → jitter inflated
Shimmer	≥ 25 dB SNR	Peak amplitude estimates corrupted → shimmer inflated
HNR	≥ 20 dB SNR	Aperiodic noise floor reduces apparent harmonic energy
AVQI	≥ 30 dB SNR	Propagated errors in ShdB + HNR corrupt composite score

For deep learning models that operate on the same recordings — such as Wav2Vec2-based pathology classifiers — input normalisation and noise floor management are equally critical. Our Wav2Vec2 clinical audio guide covers the preprocessing chain for ML pipelines built on pathological voice data.

7. Practical Reference: Metrics at a Glance

Metric	Praat command	Normal range	Pathological threshold	Python library
Jitter (local)	`Get jitter (local)`	< 1.04 %	> 2.0 %	parselmouth
Shimmer (local, dB)	`Get shimmer (local_dB)`	< 0.35 dB	> 0.60 dB	parselmouth
HNR	`To Harmonicity (cc)`	> 20 dB	< 12 dB	parselmouth
CPP	`Get CPPS`	> 12 dB	< 7 dB	parselmouth
AVQI v03	Composite formula	< 2.95	≥ 2.95	parselmouth + numpy

Frequently Asked Questions

What is the normal range for jitter in voice analysis?

Normal local relative jitter is below 1.04% in healthy adult voices. Values above this suggest abnormal cycle-to-cycle frequency instability, with values above 2% considered clearly pathological. Praat also reports RAP and PPQ5 variants; the 1% local jitter threshold is the most widely cited in clinical literature for dysphonia screening.

What does a low HNR value mean clinically?

A low HNR indicates that a large proportion of the voice signal is aperiodic noise rather than harmonic energy. HNR below 20 dB during sustained phonation is associated with breathiness, roughness, or hoarseness. Values below 7 dB suggest severe dysphonia. HNR can also be reduced artifactually by a noisy recording environment — always verify recording SNR before interpreting low HNR as clinical evidence.

What is the AVQI and how is it interpreted?

The Acoustic Voice Quality Index (AVQI v03) is a composite score combining shimmer (dB), HNR, Cepstral Peak Prominence, and spectral slope: AVQI = 3.284 + (0.181 × ShdB) − (0.098 × HNR) − (0.218 × CPP) + (0.008 × SpSlo). A score below 2.95 indicates normal voice quality; at or above 2.95 suggests dysphonia. The index has been validated across Dutch, English, Portuguese, and Korean with AUC > 0.90.

How does recording level in dBFS affect jitter, shimmer, and HNR?

Recording level directly impacts metric reliability. A signal below −40 dBFS peak buries the voice in microphone noise, inflating jitter and shimmer through false period detection while dragging HNR down. A clipped signal (above 0 dBFS) introduces waveform distortion that mimics pathological amplitude perturbation. The safe window for clinical voice recording is a peak between −20 dBFS and −12 dBFS.

Automate These Metrics in Your Clinical Workflow

Computing jitter, shimmer, HNR, and AVQI manually for every patient is time-consuming and error-prone. Vocametrix is a platform that automates all of these measurements — and more — from a single recorded audio file. It handles recording level validation, period detection, cepstral analysis, and AVQI scoring through a single API call, with results interpretable by speech-language pathologists and ML practitioners alike.

Read: dBFS Recording Levels Explained Contact Us

Voice Quality Metrics: Jitter, Shimmer, HNR, and AVQI Explained

1. Why Voice Quality Metrics Matter

2. Jitter — Cycle-to-Cycle Frequency Perturbation

Definition and Formula

Clinical Ranges

Python / Praat Computation

3. Shimmer — Cycle-to-Cycle Amplitude Perturbation

Definition and Formula

Clinical Ranges

4. HNR — Harmonics-to-Noise Ratio

Definition

Clinical Ranges

5. AVQI — Acoustic Voice Quality Index

What AVQI Is and Why It Exists

AVQI v03 Formula

6. How Recording Quality Affects Every Metric

Recording Level: The dBFS Window

Minimum SNR Requirements

7. Practical Reference: Metrics at a Glance

Frequently Asked Questions

What is the normal range for jitter in voice analysis?

What does a low HNR value mean clinically?

What is the AVQI and how is it interpreted?

How does recording level in dBFS affect jitter, shimmer, and HNR?

Automate These Metrics in Your Clinical Workflow

Related Articles

dB SPL vs dBFS: Understanding Audio Levels for Engineers and Clinicians

Fine-Tuning Audio ML Models: Plan, Audit, Diagnose

MLOps Setup – OVHcloud + DVC + MLflow (Audio Dataset)