The impact of Text Representation on Classification Accuracy
Keywords:
Text representation, classification task, classification datasets, Naïve Bayes model, Logistic Regression classifier, Support Vector MachinesAbstract
People gladly use social media to express their opinions. Emotional techniques of analysis sometimes capacitate community to harness the treasure of significant data that is included in unstructured social media information. We must ensure that the dataset is a high-quality one before the design and deployment of machine learning models. However, to run ML algorithms on text data we need to convert the content into numerical representations by one of the text representation methods during the data preprocessing stage. This study explores the usage of eight text representation methods with three ML models evaluated on a Twitter sentiment classification dataset. In particular, the experiments aim to test the impact of text length in the results classification. For this reason, the experiments were conducted with different ranges of text lengths. However, to run the experiments and evaluate the results we use common classification algorithms which include: Logistic Regression, Naïve Bayes, and Support Vector Machines. The results showed that the best performance for all models was when using Count Vectorizer to represent the text with N-gram range (2,3,4), respectively, followed by TF-IDF, while Doc2Vec, Word2Vec, and GloVe performed averagely. For the text length, the model’s performance decreases as the text length increases with all models. It was also noted that Doc2Vec, Word2Vec, and GloVe kept the performance of the models despite the change in text length and generally gave average accuracy compared to Count Vectorizer. However, the SVM classifier has surpassed all the other techniques in the whole experiment.