Mental_Health_Diagnosis_Voice_with_ViT-CNN

Mental Health Diagnosis From Voice Data Using Convolutional Neural Networks and Vision Transformers

This repository contains the implementation of the hybrid ViT-CNN model developed for diagnosing mental stability using voice data. The research paper associated with this work has been published in the Journal of Voice.

📄 Research Paper

Title: Mental Health Diagnosis From Voice Data Using Convolutional Neural Networks and Vision Transformers
Authors: Rafiul Islam, Dr. Md. Taimur Ahad, et al.
Journal: Journal of Voice (Q1 Journal)
Access the Paper Here

📊 Project Overview

Mental health diagnostics are traditionally subjective, relying on self-reports and clinician observations. This study proposes a voice-based diagnostic approach using a hybrid Vision Transformer (ViT) and Convolutional Neural Network (CNN) model to classify mental stability.

Key Highlights:

Feature Extraction: Log-mel spectrograms generated from audio signals.
Model Architecture: Combines CNNs for local feature extraction and Vision Transformers for capturing long-range dependencies.
Performance: Achieved 91% classification accuracy.
Dataset: 85 recordings collected ethically from Bangladeshi participants.

📋 Requirements

Python 3.8 or above
TensorFlow, Keras, Librosa, Matplotlib, NumPy, Scikit-learn, SMOTE

🛠 How to Run

Install Dependencies:
- pip install librosa matplotlib numpy seaborn tensorflow keras pandas scikit-learn tqdm imblearn
Prepare Dataset:
- Download the dataset from (https://data.mendeley.com/datasets/s5j25b5tjk/1)
- Place stable and unstable audio files in data/mentally_stable/ and data/mentally_unstable/ folders.
Run Notebook:
- Open and execute notebooks/Updated_ViT_CNN_Model.ipynb.

📊 Results

Accuracy: 91%

Key Visualizations:

Training/validation accuracy and loss curves
Confusion matrix
ROC curves

📜 Citation

If you use this work, please cite our paper:

Islam, R., Ahad, M. T., Ahmed, F., Song, B., & Li, Y. (2024). Mental health diagnosis from voice data using convolutional neural networks and vision transformers. Journal of Voice. DOI:10.1016/j.jvoice.2024.10.010

🧑‍💻 Author

Rafiul Islam
Researcher at Daffodil International University

Let me know if you’d like to proceed with implementing this structure, or if you want specific adjustments to the README.md. I can also help with creating scripts or setting up the repository locally!

This site is open source. Improve this page.