Predicting smoking status from short voice recordings under small-sample constraints: A calibrated leave-one-speaker-out study

Aydoğan, Yiğit; Duygun, Oğuzhan; CANTÜRK, İsmail

doi:10.1016/j.bspc.2026.109915

Predicting smoking status from short voice recordings under small-sample constraints: A calibrated leave-one-speaker-out study

Aydoğan Y., Duygun O., CANTÜRK İ.

Biomedical Signal Processing and Control, cilt.119, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 119
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.bspc.2026.109915
Dergi Adı: Biomedical Signal Processing and Control
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, EMBASE
Anahtar Kelimeler: Calibration, Decision-curve analysis, LOSO evaluation, Small-sample learning, Smoking detection, Voice biomarkers
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

The feasibility of inferring smoking status from short voice recordings was examined under small-sample, speaker-independent constraints, emphasizing calibrated probabilities and decision utility for screening. Sustained/a/ phonations recorded on smartphones (44.1 kHz, mono) were analyzed from 64 unique speakers (30 smokers, 34 non-smokers; prevalence 0.469; one recording per speaker). Two representation families were compared: (i) a physiology-informed handcrafted prosody–spectral set (208 variables) summarizing perturbation, harmonicity/noise structure, spectral-energy distribution, and formants; and (ii) pretrained embeddings (YAMNet, wav2vec 2.0, WavLM) pooled to utterance vectors and classified with partial least squares plus logistic regression. Models were evaluated with strict leave-one-speaker-out validation, nested hyperparameter selection, fold-safe preprocessing, and within-fold Platt scaling. The handcrafted elastic-net logistic model (PS_ENet) achieved the strongest discrimination (AUC = 0.885), with accuracy 0.844, F1 0.828, average precision 0.894, and Brier score 0.193. Embedding baselines underperformed (AUC: YAM_PLS 0.475; W2V2_PLS 0.561; WAVLM_PLS 0.525). A probability-averaging ensemble favored sensitivity (recall 0.833) with AUC 0.797. Demographics alone were partially predictive (age+gender LOSO AUC = 0.708), but PS_ENet retained incremental discrimination under restricted permutation within age × gender strata and demographic matching. Speaker-level bootstrapping yielded PS_ENet AUC 0.886 (95% CI 0.790–0.962), and full-pipeline permutation testing supported discrimination beyond chance (p≈0.005). Decision-curve analysis showed positive net benefit for thresholds 0.05–0.30 (exceeding treat-all for thresholds ≥0.15). Prospective planning suggested N≈44 speakers for 80% power to detect AUC = 0.70 under the observed prevalence.