Punctuation Marks Effect on Arabic Authorship Attribution Using a Variable Length of Character N-grams

Fatma Howedi; Souad Alharm

Authors

Fatma Howedi
f.howedi@asmarya.edu.ly
Computer Science, Information Technology Collage, Alasmarya Islamic University, Zliten, Libya
Souad Alharm Computer Science, Information Technology Collage, Alasmarya Islamic University, Zliten, Libya

Keywords:

Authorship Attribution, Text classification, Punctuation marks, Character n-grams, Machine Learning

Abstract

The problem of Authorship Attribution (AA) relies on distinguishing features to capture the writing style of the author. The models of character n-gram have been identified as the most successful features for representing the stylistic properties of a text. This study explores the use of punctuation marks within character n-grams as a feature representation of a document for Arabic AA of short texts. Starting from a variable length of character n-grams (2-, 3-, 4-, and 5-grams) used to generate feature vectors, the experiments were conducted independently for each feature condition, using Chi-squared selection method with varying feature set sizes. Different machine learning was trained to represent the probability of membership for certain authors. This study showed that by adding punctuation to the construction of character n-grams, the length of 5-grams and 4-grams enhanced the classification performance more than smaller lengths of 2-grams and 3-grams conditions. The results confirmed a high attribution effectiveness at 0.93% with Macro F₁- measure for AA of short texts. This method yields an improvement in the performance of AA by 7.5% with Macro F₁- measure that when punctuation marks are used within character n-grams. The punctuation therefore provides further insight into the writing style of the author. This study contributes in improving the attribution performance of the issue of text size for Arabic authorship attribution.

Dimensions

Punctuation Marks Effect on Arabic Authorship Attribution Using a Variable Length of Character N-grams

Authors

Keywords:

Abstract

Published

How to Cite

Issue

Section

Make a Submission

Language

Current Issue

Information

Keywords