مقایسه عملکرد الگوریتم های پایه یادگیری ماشین در دسته بندی اشعار فارسی به دو گروه تلمیح دار و بدون تلمیح (مقاله علمی وزارت علوم)

درجه علمی: نشریه علمی (وزارت علوم)

نویسندگان: پریسا محمدیان کلخوران محمد بحرانی

منبع: علم زبان سال 12 بهار و تابستان 1404 شماره 21

کلیدواژه‌ها: تلمیح شعر فارسی دسته بندی متن یادگیری ماشین پردازش زبان طبیعی

حوزه‌های تخصصی:

حوزه‌های تخصصی زبان شناسی

doi: 10.22054/ls.2021.60784.1453

شماره صفحات: ۴۵ - ۷۶

دریافت مقاله تعداد دانلود : ۲۱

آرشیو

چکیده

هدف از پژوهش حاضر بررسی عملکرد چند روش یادگیری ماشین در دسته بندی اشعار فارسی به دو گروه تلمیح دار و بدون تلمیح است. به این منظور، از روش های نظارت شده بیز ساده، ماشین بردار پشتیبان، درخت تصمیم، جنگل تصادفی، k نزدیک ترین همسایه، رگرسیون لجستیک و الگوریتم پرسپترون چندلایه استفاده شد. پس از جمع آوری داده های برچسب خورده در قالب دو فایل متنی، هرکدام از ابیات به بردار عددی تبدیل شدند. پس از ادغام داده ها و تقسیم آنها به دو دسته آموزش و آزمون، الگوریتم مدنظر بر روی داده های آموزشی پیاد ه سازی و بر روی داده های آزمون، آزمایش گردید تا دقت عملکرد الگوریتم سنجیده شود. خروجی هر الگوریتم، برچسب پیش بینی شده توسط ماشین برای ابیات موردنظر بود و برای ارزیابی الگوریتم ها از روش LOOCV استفاده شد. نتایج ارزیابی نشان داد که الگوریتم های بیز ساده 09/76%، رگرسیون لجستیک 09/76%، پرسپترون چند لایه 22/75% و ماشین بردار پشتیبان 35/74% نسبت به الگوریتم های دیگر عملکرد بهتری دارند. درمجموع و با توجه به سایر معیارها، از جمله معیار اف 1 و زمان اجرا، می توان گفت که بهترین عملکرد مربوط به الگوریتم بیز ساده بود.

Comparing Basic ML Algorithms for Classifying Allusive vs. Non-Allusive Persian Poetry

The present study aims to evaluate the performance of several machine learning methods in classifying Persian poetry into two categories: verses containing allusion and those without allusion. To this end, the following supervised learning algorithms were applied: Naive Bayes, Support Vector Machine (SVM), Decision Tree, Random Forest, k-Nearest Neighbors (k-NN), Logistic Regression, and Multilayer Perceptron (MLP). Labeled data were collected and stored in two text files. Each verse was then transformed into a numerical vector representation. After merging the datasets and splitting them into training and testing subsets, each algorithm was implemented on the training data and evaluated on the test set to assess its classification accuracy. The output of each algorithm consisted of predicted labels assigned to the verses. The evaluation methodology used was Leave-One-Out Cross-Validation (LOOCV). The results indicate that Naive Bayes (76.09%), Logistic Regression (76.09%), Multilayer Perceptron (75.22%), and Support Vector Machine (74.35%) performed better than the other algorithms in terms of accuracy. Overall, considering additional metrics such as F1-score and execution time, Naive Bayes demonstrated the best performance among the tested methods. Introduction Persian poetry is a rich cultural heritage filled with literary devices, one of which is allusion, a rhetorical technique that enriches meaning through indirect references to well-known stories, sayings, or cultural elements. As society becomes increasingly intertwined with technology, there is a growing need to leverage machine learning to preserve and study such literary elements. This study explores the application of supervised machine learning algorithms to classify Persian couplets as either containing allusion or not, thus contributing to the intersection of literary analysis and artificial intelligence. Literature Review Allusion has long been recognized as a central rhetorical device in Persian literature. Classical scholars such as Qeys Razi (1372: 279) and the author of Anvār al-Balāgha (17 th century AD) defined it as a brief, implicit reference to the well-known texts, events, or sayings. Shamisa (1381: 22) expanded the definition, incorporating references to folklore, customs, and scientific concepts as part of the broader allusive framework. These definitions consistently highlight conciseness, implicitness, and the need for cultural familiarity to interpret the text accurately. A large number of literary studies have focused on identifying allusions in the works of specific poets using manual and qualitative research methods. The field of computational literary studies is relatively underexplored in Persian, though some efforts have been made. For instance, works by Majiri and Minaei (2008) applied text mining techniques for prosodic analysis, while others such as Azin & Bahrani (2014) and Javanmardi & Akbari (2017) explored authorial style classification. Machine learning, particularly supervised learning, has recently emerged as a promising approach in text classification, including applications in authorship attribution, genre detection, and sentiment analysis. However, the use of ML for detecting literary devices like allusion remains limited, especially in Persian literary texts. Two general approaches are used for text classification: vocabulary-based (dictionary-driven) and machine learning-based. This study focuses on the latter, aiming to automatically classify verses using statistical and linguistic features rather than predefined word lists. Methodology This study followed a seven-step machine learning pipeline using Python programming language: Data Collection: Allusive verses were extracted from Farhang-e Asatir va Dastanvareha by Mohammad Jafar Yahaghi (1388) and were typed manually. Non-allusive verses were collected from the Ganjoor website (ganjoor.net). The dataset included 300 allusive and 160 non-allusive couplets, stored in separate .txt files. Data Preprocessing: Text normalization and cleaning involved punctuation removal, spacing corrections, and standardization. Tokenization and stop-word removal were conducted using a curated stop-word list tailored to the dataset. Due to the limited data size and poetic language variability, lemmatization was omitted to preserve original word forms. Feature Extraction and Vectorization: The remaining 2107 unique words after preprocessing were treated as features. Two vectorization techniques were used: TF-IDF (Term Frequency–Inverse Document Frequency) Binary Encoding (presence/absence of words) The effect of the vectorization method on classification performance was observed, especially in Logistic Regression and MLP classifiers. Dataset Splitting: The dataset was divided into training and test sets using Leave-One-Out Cross-Validation (LOOCV) to maximize accuracy evaluation on a small dataset. Model Training: Seven supervised machine learning algorithms were trained: Naive Bayes, Support Vector Machine (SVM), Decision Tree, Random Forest, K-Nearest Neighbors (KNN), Logistic Regression, Multilayer Perceptron (MLP). Model Evaluation: Accuracy, F1-score, and execution time were used as performance metrics. LOOCV ensured robust evaluation by testing each data point independently. Output Generation: The predicted label for each couplet was obtained using the trained models. The predicted labels were compared with actual labels to compute accuracy. Results The accuracy scores for the best-performing models were: Naive Bayes: 76.09%, Logistic Regression: 76.09%, Multi-layer Perceptron: 75.22%, SVM: 74.35%. Other models like Decision Tree, Random Forest, and KNN showed lower performance. Notably, most models showed negligible performance differences between the two vectorization methods, except for Logistic Regression and MLP, which exhibited a 10% accuracy gap. Discussion The results suggest that simpler models like Naive Bayes can perform competitively in literary text classification tasks, especially when the dataset is limited. The choice of vectorization method had minimal impact on overall accuracy, underscoring the importance of careful feature selection over sophisticated encoding. Despite the challenges of working with poetic texts—including varied structure and archaic language—machine learning offers promising results in detecting literary features like allusion. Conclusion This study aimed to evaluate the performance of several supervised machine learning algorithms in classifying Persian poetry into two categories: verses containing allusions and those without. Algorithms such as Naive Bayes, Support Vector Machine, Decision Tree, Random Forest, k-Nearest Neighbors, Logistic Regression, and Multilayer Perceptron were applied, with their performance assessed using Leave-One-Out Cross-Validation (LOOCV). The research highlighted several challenges in automatic text classification, especially in Persian poetry, including high feature dimensionality, lexical ambiguities, flexible sentence structures, and stylistic diversity among poets. Specifically, the identification of literary allusions proved to be complex due to the implicit nature of many references, variation in allusion types (e.g., Quranic, mythological, national), and the evolving nature of allusive language over time. Despite these difficulties, some algorithms demonstrated promising results, suggesting that with larger, more diverse datasets and the use of advanced techniques such as transfer learning and pre-trained language models, future efforts in this area could achieve higher accuracy and robustness. The intersection of computational methods and literary analysis presents a valuable avenue for deeper understanding and automated processing of Persian literary texts.