SelfSupervised-ViT-Noisy-Voice-Spectrograms

Self-Supervised Learning with Vision Transformer (ViT) for Noisy Real-World Voice Data

This repository contains the implementation of a self-supervised pretraining + supervised fine-tuning pipeline for classifying psychologically stable vs psychologically unstable voice samples under noisy real-world conditions.

The workflow first learns representations from unlabeled spectrograms using masked spectrogram reconstruction (self-supervised learning), then transfers the learned knowledge into a ViT-style Transformer classifier trained and evaluated on unprocessed / noisy voice data.


πŸ“„ Research Paper

Title: Self-Supervised Learning with Vision Transformer (ViT) for Noisy Real-World Data
Authors: Rafiul Islam, Dr. Md. Taimur Ahad, et al.
Status: Ongoing Research / Manuscript in Preparation
Note: The final paper link will be added here after submission/acceptance.


πŸ“Š Project Overview

Real-world voice data collected outside laboratory conditions is often noisy, untrimmed, and distribution-shifted, which significantly degrades the performance of fully supervised models trained on curated datasets. This project addresses this challenge by introducing a self-supervised representation learning stage prior to supervised classification.

The proposed framework consists of two stages:

  1. Self-Supervised Pretraining on unlabeled voice spectrograms using a masked reconstruction objective.
  2. Supervised Fine-tuning of a Vision Transformer (ViT)-style classifier on noisy / unprocessed data.

πŸ”‘ Key Highlights


🧠 Methodology

1) Self-Supervised Pretraining (Masked Reconstruction)

2) Supervised Classification Model (ViT-style)

The classifier integrates multiple components:

3) Transfer Learning (SSL β†’ Classifier)


πŸ“‹ Requirements


πŸ›  How to Run

1) Install Dependencies

pip install librosa matplotlib numpy pandas seaborn tensorflow keras scikit-learn tqdm imbalanced-learn visualkeras opencv-python scikit-image

2) Prepare Dataset

Create the following folder structure:

data/
β”œβ”€β”€ processed/
β”‚   β”œβ”€β”€ mentally_stable/
β”‚   └── mentally_unstable/
└── unprocessed_noisy/
    β”œβ”€β”€ mentally_stable/
    └── mentally_unstable/

Recommended audio and feature settings (from experiments):

⚠️ The dataset is not included in this repository due to ethical and privacy considerations.

3) Run Notebook

Open and execute:


πŸ“Š Results

βœ… Model Architecture

High-level visualization of the reconstruction and classification backbone:

Model Architecture


βœ… 5-Fold Cross-Validation (Training Dynamics)

Accuracy and loss curves across all folds demonstrate stable convergence under noisy conditions:

K-Fold Accuracy and Loss


βœ… ROC Curves Across All Folds

Receiver Operating Characteristic (ROC) curves for each fold indicate robustness and variability under distribution shift:

ROC Curves


βœ… Ensemble Confusion Matrix (Top Folds)

An ensemble of top-performing folds is used for final evaluation on the test set:

Overall ensemble performance:

Ensemble Confusion Matrix


πŸ“œ Citation

If you use this work, please cite this repository (until the paper is published):


πŸ§‘β€πŸ’» Author


⚠️ Disclaimer

This project is for research and educational purposes only and does not provide medical advice or clinical diagnosis. Always consult qualified professionals for mental health-related decisions.