Punctuation Marks Effect on Arabic Authorship Attribution Using a Variable Length of Character N-grams
Keywords:
Authorship Attribution, Text classification, Punctuation marks, Character n-grams, Machine LearningAbstract
The problem of Authorship Attribution (AA) relies on distinguishing features to capture the writing style of the author. The models of character n-gram have been identified as the most successful features for representing the stylistic properties of a text. This study explores the use of punctuation marks within character n-grams as a feature representation of a document for Arabic AA of short texts. Starting from a variable length of character n-grams (2-, 3-, 4-, and 5-grams) used to generate feature vectors, the experiments were conducted independently for each feature condition, using Chi-squared selection method with varying feature set sizes. Different machine learning was trained to represent the probability of membership for certain authors. This study showed that by adding punctuation to the construction of character n-grams, the length of 5-grams and 4-grams enhanced the classification performance more than smaller lengths of 2-grams and 3-grams conditions. The results confirmed a high attribution effectiveness at 0.93% with Macro F1- measure for AA of short texts. This method yields an improvement in the performance of AA by 7.5% with Macro F1- measure that when punctuation marks are used within character n-grams. The punctuation therefore provides further insight into the writing style of the author. This study contributes in improving the attribution performance of the issue of text size for Arabic authorship attribution.