Overview

There are two ways of getting the prediction from our model.

Preprocess -> Embedding Model -> Classifier.
Preprocess -> Embedding Model -> Label correction -> Classifier.

We can get an average accuracy of 94% by method 1. Furthermore, the average accuracy increased to 96% by adding label correction algorithm.

Reference: "Deep Learning for Suicide and Depression Identification with Unsupervised Label Correction" by Ayaan Haque*, Viraaj Reddi*, and Tyler Giallanza from Saratoga High School and Princeton University. In ICANN, 2021.

Methodology

Step 1

Preprocess data

Filter content length < 100 words (~150,000 data), remove URLs and Emoji, unpack contraction words and case normalization.

Step 2

Embedding Models

BERT (Bidirectional Encoder Representations from Transformers) is a popular transfer-based word embedding model. Instead of proceeding word by word sequentially like RNN/LSTM, it totally avoids recursion, by processing sentences as a whole and by learning relationships between words.

We feed 150,000 posts into distilBERT to extract only the dimension with [CLS] for our classification task.

Step 3

Decompose and cluster

In this step, we use PCA (Principal component analysis) to decompose the data from (n, 768) to (n, 2) where n is the number of posts and 768 is the number of features extracted from BERT.

Step 4

GMM and Relabel

Cluster the data into two groups by GMM (Guassian Mixture Model) and Relabel the data which has different label with its initial.

Get the results from GMM (e.g. 90% in group 0 and 10% in group 1)
Relabel the data only if "data is different from its initial" AND "the possibility is higher than 95%"".

Source: https://towardsdatascience.com/building-a-deep-learning-model-using-keras-1548ca149d37

Step 5

DNN classifier

The network is composed of an input layer of size 768, a hidden ReLu layer of size 128, and another hidden ReLu layer of size 64. Output layer is a sigmoid activation layer. Adam optimizer is used with binary cross entropy for loss.

Check out the results

Our Fully Dense Network with unsupervised GMM clustering confidence correction achieved a 96% testing accuracy in successfully determining whether input sentences portray depressive or suicidal sentiment.

Objective:

Solution:

Library we used in this project