Call us Today ! +918886268863 | [email protected]

A Novel Preprocessing Technique for ToxicComment Classification

A Novel Preprocessing Technique for ToxicComment Classification

Introduction:

The threat of online abuse and harassment is increasing day by day in the cyber community. To tackle this problem, many platforms have devised policies. But these policies require prior identification of the content that is inappropriate and offensive. Furthermore, the data contains various aspects of negativity, for example, a particular piece of comment can express, disgust, disbelief, and threat at the same time. It points that even the negativity/toxicity exhibited in a comment can have various facets. Hence, the challenge is to identify what exactly is exhibited in comments so that respective policies can be formulated and applied to penalize the offender.

Abstract:

This study makes use of two approaches to identify these underlying toxicities in the comments. The first approach is to train separate classifiers against each facet of the toxicity in comments. The second approach deals with the problem as a multi-label classification problem. Different machine learning approaches including logistic regression, Naive Bayes, and decision tree classification are employed to carry out this study. The dataset is taken from Kaggle and 10-fold cross-validation is used to report the robustness of the model. The study uses a novel preprocessing scheme that transforms the multi-label classification problem into the multi-class classification problem. The preprocessing strategy has shown a significant improvement in the accuracies when employed for simple classification models encouraging its use for other sophisticated models as well.

Existing work:

Recently in February 2018, a paper titled “Convolutional Neural Networks for Toxic Comment Classification” introduced a new model for toxic comment classification based on convolutional neural network (CNN) [14]. Instead of using the traditional bag-of-words model, this paper uses CNN over document term matrix (DTM) and compares the results with other well-known classification models such as k-Nearest Neighbors (KNN), support vector machine (SVM), Linear Discriminant Analysis (LDA), and Naive Bayes (NB). The dataset used by the authors of this paper was the same Kaggle toxic comment challenge dataset. The use of CNN made it possible to achieve accuracy greater than previously achieved by others (i.e., approx. 91.2%).

Disadvantage:

It can be noted that most of the researchers of this field have generated their own datasets. Manual generation of the datasets is not bad, but they all have some constraints on dataset size.

Proposed work:

This study makes use of two approaches to identify these underlying toxicities in the comments. The first approach is to train separate classifiers against each facet of the toxicity in comments. The second approach deals with the problem as a multi-label classification problem. Different machine learning approaches including logistic regression, Naive Bayes, and decision tree classification are employed to carry out this study. The dataset is taken from Kaggle and 10-fold cross-validation is used to report the robustness of the model.

Advantage:

In both the binary classification and the multi-classification.

Algorithm: logistic regression, Naive Bayes, and decision tree.

System requirements:

  Software requirements:

  • Operating system   :   Windows.
  • Coding Language  :   Python.

Hardware components:

System                   :   Pentium IV 2.4 GHz or intel

Hard Disk              :   40 GB.

Floppy Drive         :   1.44 Mb.

Mouse                    :   Optical Mouse.

Ram                       :   512 Mb.

Conclusion: Due to the increase in online discussions after the advent of social media, there has been a huge increase in cyber harassment. Various social discussion portals and social networking sites have devised policies to penalize the offenders. However, identification of the type and level of negativity/toxicity contained within a comment presents a huge challenge. Therefore, the study has proposed two approaches to deal with this problem. Firstly, a dataset downloaded from the Kaggle competition is analyzed. After thorough data analysis, key-features that are significant indicators of toxic comments are identified. Later, two different approaches are applied to perform the classification that includes binary classification against each facet of toxicity and multi-label classification. In addition, various machine learning approaches are also applied against each classification scheme. 10-fold cross validation is applied to perform the evaluation.

March 14, 2022

0 responses on "A Novel Preprocessing Technique for ToxicComment Classification"

Leave a Message

Template Design © VibeThemes. All rights reserved.