Hybrid-DenseViTGRU-XAI-Voice

Python Deep Learning XAI

Hybrid DenseNet–ViT–GRU With XAI for Mental Stability Detection From Voice Data

This repository contains the implementation of a hybrid DenseNet–Vision Transformer (ViT)–GRU model with Explainable AI (XAI) for diagnosing psychological stability using voice data. The pipeline converts raw audio into log-mel spectrograms, trains a hybrid deep model with SMOTE + augmentation, and explains predictions using Grad-CAM, LIME, and SHAP.


📄 Research Paper

Title: Hybrid Deep Models for Mental Health Detection with XAI Techniques (DenseNet–ViT–GRU)
Authors: Rafiul Islam, Dr. Md. Taimur Ahad, et al.
Status: Ongoing Research / Manuscript in Preparation
Note: The final paper link will be added here after submission/acceptance.


📊 Project Overview

Mental health diagnostics are often subjective and dependent on self-reports and clinician observation. This research proposes a voice-based diagnostic approach using a hybrid deep learning architecture that combines:

Key Highlights:


📋 Requirements


🛠 How to Run

  1. Install Dependencies:
    pip install librosa matplotlib numpy pandas seaborn tensorflow keras scikit-learn tqdm imbalanced-learn opencv-python scikit-image lime shap
    
  2. Prepare Dataset: Create the following folder structure:
    data/
    ├── mentally_stable/
    │   ├── *.wav
    └── mentally_unstable/
     ├── *.wav
    

Recommended audio settings (from experiments):

⚠️ Dataset is not included in this repository due to privacy/ethical constraints.

  1. Run Notebook: Open and execute:
    • notebooks/Final_Result_Dense_ViT_GRU_With_XAI.ipynb

📊 Results

✅ Data & Preprocessing Visuals


✅ 5-Fold Cross Validation Training Behavior


✅ Confusion Matrix (Combined Across All Folds)

The combined confusion matrix across folds shows strong separation between classes:

This yields an overall combined accuracy of ~89.81%.

Confusion Matrix


✅ ROC Curves (All Folds)

AUC per fold:

Mean AUC ≈ 0.962.

ROC Curves


🔍 Explainable AI (XAI) Results

This project includes three explanation techniques to interpret model decisions on spectrogram inputs.

1) LIME (Local Explanations)

LIME highlights positive (green) and negative (red) contributing regions for individual predictions.

LIME Examples


2) SHAP (Global + Local Explanations)

SHAP visualizes contribution strength using Shapley-value inspired attribution.


3) Grad-CAM (Model Attention Heatmap)

Grad-CAM highlights the most influential time–frequency regions used by the model.

GradCAM


📜 Citation

If you use this work, please cite this repository (until the paper is published):

(Replace this with the final paper citation after publication.)


🧑‍💻 Author


⚠️ Disclaimer

This project is for research and educational use only and does not provide medical advice or clinical diagnosis. Always consult qualified professionals for mental health concerns.