Natural Language Processing

۱.

Text complexity of reading comprehension passages in the National Matriculation English Test in China: The development from 1996 to 2020

نویسنده: Xiaoli Yu

منبع: International Journal of Language Testing, Volume ۱۱, Issue ۲, Summer and Autumn ۲۰۲۱ 142 - 167

کلیدواژه‌ها: corpus linguistics High Stakes Exam Natural Language Processing Reading Comprehension Text Complexity

حوزه‌های تخصصی:

حوزه‌های تخصصی زبان شناسی

تعداد بازدید : ۶۱۰ تعداد دانلود : ۲۸۱

This study examined the development of text complexity for the past 25 years of reading comprehension passages in the National Matriculation English Test (NMET) in China. Text complexity of 206 reading passages at lexical, syntactic, and discourse levels has been measured longitudinally and compared across the years. The natural language processing tools used in the study included TAALES, TAALED, TAASSC, and TAACO. To compare the differences across the years at various levels of text complexity, ANOVA and MANOVA tests were conducted. The results suggested that lexical level text complexity revealed the most evident changes throughout the years, lexical sophistication, density, and diversity levels of the most recent years of reading passages have increased remarkably compared to the early years. The syntactic level text complexity indicated a moderate elevation toward the recent years of reading passages. For the discourse level text complexity, regarding cohesion, insignificant fluctuation occurred throughout the years and the general trend was not necessarily increasing. Combined, the results indicated that text complexity of the reading comprehension passages in the NMET over the past 25 years had been steadily increasing by including more low frequency and academic vocabulary, diversifying vocabulary in the passages, and complicating sentence and grammatical structures. The results were further examined against the general curriculum standards and guidelines to analyze whether the changes were reflected in the policies. It showed that the exams required a much larger vocabulary size than the number indicated in the guidelines, suggestions for test designers and pedagogical practices were provided accordingly.

۲.

Generation of Syntax Parser on South Indian Language using Bottom-Up Parsing Technique and PCFG(مقاله علمی وزارت علوم)

نویسنده: M. Rajani Shree Shambhavi B. R.

منبع: Journal of Information Technology Management , Volume ۱۵, Special Issue, ۲۰۲۳ 19 - 33

کلیدواژه‌ها: Natural Language Processing Artificial Intelligence Syntax Parser CYK Parsing Algorithm Probabilistic Context Free Grammar

حوزه‌های تخصصی:

حوزه‌های تخصصی مدیریت مدیریت دانش و IT

تعداد بازدید : ۷۱۹ تعداد دانلود : ۳۵۲

In our research, we provide a statistical syntax parsing method experimented on Kannada texts, which is an official language of Karnataka, India. The dataset is downloaded from TDIL website. Using the Cocke-Younger-Kasami (CYK) parsing technique, we generated Kannada Treebank dataset from 1000 annotated sentences in the first stage. The Treebank generated in this stage contains 1000 syntactically structured sentences and it is used as input to train the syntax parser model in the second stage. We have adopted Probabilistic Context Free Grammar (PCFG) while training the parser model and extracting the Chmosky Normal Form (CNF) grammar from a Treebank dataset. The developed syntax parser model is tested on 150 raw Kannada sentences. It outputs with the most likely parse tree for each sentence and this is verified with golden Treebank. The syntax parser model generated 74.2% precision, 79.4% recall, and 75.3% F1-score respectively. The similar technique may be adopted for other low resource languages.

۳.

Preprocessing of Aspect-based English Telugu Code Mixed Sentiment Analysis(مقاله علمی وزارت علوم)

نویسنده: Arun Kodirekka Ayyagari Srinagesh

منبع: Journal of Information Technology Management , Volume ۱۵, Special Issue: Digital Twin Enabled Neural Networks Architecture Management for Sustainable Computing, ۲۰۲۳ 150 - 163

کلیدواژه‌ها: English-Telugu code-mixed data Natural Language Processing Telugu Senti Wordnet Machine Learning deep learning

تعداد بازدید : ۴۱۶ تعداد دانلود : ۲۴۶

Extracting sentiments from the English-Telugu code-mixed data can be challenging and is still a relatively new research area. Data obtained from the Twitter API has to be in English-Telugu code-mixed language. That data is free-form text, noisy, lexicon borrowings, code-mixed, phonetic typing and misspelling data. The initial step is language identification and sentiment class labels assigned to each tweet in the dataset. The second step is the data normalization task, and the final step is classification, which can be achieved using three different methods: lexicon, machine learning, and deep learning. In the lexicon-based approach, tokenize each tweet with its language tag. If the language tag is in Telugu, transliterate the roman script into native Telugu words. Words are verified with TeluguSentiWordNet, and the Telugu sentiments are extracted, and English SentiWordNets are used to extract sentiments from the English tokens. In this paper, the aspect-based sentiment analysis approach is suggested and used with normalized data. In addition, deep learning and machine learning techniques are applied to extract sentiment ratings, and the results are compared to prior work.

۴.

The Study on Qur'anic surahs' Structured-ness and their Order Organization Using NLP Techniques(مقاله علمی وزارت علوم)

نویسنده: احسان خدنگی محمدمعین فاضلی مهدی نقوی

منبع: Interdisciplinary Qur'anic Studies, Volume ۱, Issue ۲, December ۲۰۲۲ 29-56

کلیدواژه‌ها: Natural Language Processing Word2vec Quran Topic Sameness Surahs' Structuredness TF-IDF

حوزه‌های تخصصی:

حوزه‌های تخصصی علوم اسلامی تفسیر و علوم قرآن علوم قرآنی

تعداد بازدید : ۴۵۰ تعداد دانلود : ۲۸۵

The study of surahs' structure has attracted researchers' attention in recent years. One of the theories herein is the theory of Topic Sameness which acknowledges that each surah of Quran has formed on a single topic. The theory of Introduction and Explanation as one of the most important branches of Topic Sameness, proposes that the Almighty states the topic of each surah at the first section, elaborates it at different parts of the surah in the forms such as stories, signals of nature, and future predictions, and concludes from the stated contents at the final part. In this paper, we accordingly intend to study the two theories using NLP techniques for the first time. In this regard, based on the three methods of tf-idf, word2vec and roots' accompaniment in verses, the similarity of Quranic roots is computed. Then, the amount of similarity of the concepts within surahs to each other is calculated and compared with the random mode. The results show that the studied surahs hold the inner coherence between the concepts so that they have been formed on a single topic or a few topics related to each other. In addition, the study on the similarity between the first and the body sections of each surah shows that the structure of Introduction and Explanation seems to be true for many surahs by the designed methodology. At the end, by comparing the similarity of surahs to each other versus their order distance in Quran and their revelation time distance, we realized that the whole Quran is also relatively organized in terms of surah' ordering.

۵.

Emotion Detection from the Text of the Qur’an Using Advance Roberta Deep Learning Net(مقاله علمی وزارت علوم)

نویسنده: مصطفی کرمی علیرضا طالب پور فرزانه تاج آبادی زینب حاجی محمدی

منبع: Interdisciplinary Qur'anic Studies, Volume ۲, Issue ۱, June ۲۰۲۳ 147-165

کلیدواژه‌ها: Emotion detection Natural Language Processing Transformers Parts of speech Dependency Parsing Qur’an text mining

حوزه‌های تخصصی:

حوزه‌های تخصصی علوم اسلامی تفسیر و علوم قرآن

تعداد بازدید : ۳۷۸ تعداد دانلود : ۲۶۶

As data and context continue to expand, a vast amount of textual content, including books, blogs, and papers, is produced and distributed electronically. Analyzing such large amounts of content manually is a time-consuming task. Automatic detection of feelings and emotions in these texts is crucial, as it helps to identify the emotions conveyed by the author, understand the author's writing style, and determine the target audience for these texts. The Qur’an, regarded as the word of God and a divine miracle, serves as a comprehensive guide and a reflection of human life. Detecting emotions and feelings within the content of the Qur’an contributes to a deeper understanding of God's commandments. Recent advancements, particularly the application of transformer-based language models in natural language processing, have yielded state-of-the-art results that are challenging to surpass easily. In this paper, we propose a method to enhance the accuracy and generality of these models by incorporating syntactic features such as Parts Of Speech (POS) and Dependency Parsing tags. Our approach aims to elevate the performance of emotion detection models, making them more robust and applicable across diverse contexts. For model training and evaluation, we utilized the Isear dataset, a well-established and extensive dataset in this field. The results indicate that our proposed model achieves superior performance compared to existing models, achieving an accuracy of 77% on this dataset. Finally, we applied the newly proposed model to recognize the feelings and emotions conveyed in the Itani English translation of the Qur’an. The results revealed that joy has the most significant contribution to the emotional content of the Holy Qur’an.

۶.

Study of the Organization of the Qur’anic Surahs Using the Similarity-Based Approach in Deep Learning(مقاله علمی وزارت علوم)

نویسنده: احسان خدنگی محسن شعبانی

منبع: Interdisciplinary Qur'anic Studies, Volume ۲, Issue ۲, December ۲۰۲۳ 69 - 85

کلیدواژه‌ها: the Qur’an deep learning Deep Neural Network Clustering surah similarity Natural Language Processing

حوزه‌های تخصصی:

حوزه‌های تخصصی علوم اسلامی تفسیر و علوم قرآن

تعداد بازدید : ۲۵۸ تعداد دانلود : ۱۶۲

According to numerous studies, the Qur’anic surahs exhibit internal structure and organization, with each surah serving a distinct purpose. Although each surah focuses on a specific theme and the Qur’an identifies 114 broad themes, the arrangement of the surahs and the remarkable similarity between adjacent surahs (neighbors) underscores the chain-link and deliberate positioning of the surahs within the Qur’an. To investigate this phenomenon, a multifaceted and compound model was developed, comprising two main parts: embedding and autoencoding. The first part was carried out by preparing the words and roots of the Qur’anic text using the BERT model for meaning-topic representation. In the second part, the data was clustered in a soft labeling mode by the autoencoder. Analysis of the distribution of surahs within clusters revealed that neighboring surahs exhibited an average similarity of 80, while surahs with greater distance showed an average similarity of 20. The findings support the placement of similar surahs in close proximity, substantiating the organized sequence of Qur’anic surahs. To conclude, the results provide compelling evidence for the structured arrangement of Qur’anic surahs.

۷.

Empowering Students with Innovative AI-Language Learning Tools and Pedagogy to Master Speaking Skills(مقاله علمی وزارت علوم)

نویسنده: حامد فتحی عارف شریفی حسین احمدی

منبع: Iranian Journal of Applied Language Studies,Vol ۱۷, No. ۱, ۲۰۲۵ 197 - 220

کلیدواژه‌ها: Artificial Intelligence AI applications Natural Language Processing Computer-assisted language learning AI-supported language teaching

حوزه‌های تخصصی:

حوزه‌های تخصصی زبان شناسی

تعداد بازدید : ۱۵۳ تعداد دانلود : ۱۰۶

Artificial Intelligence (AI) is increasingly transforming the landscape of education, particularly within the domain of language learning, as evidenced by a growing body of research published in computer-assisted language learning (CALL) journals. These studies have examined the application of various AI technologies, including natural language processing (NLP), AI-driven educational platforms, automatic speech recognition, and chatbots, in facilitating language acquisition. The present study investigated the perceptions of 386 Iranian high school EFL students, utilizing the Students’ Perceived EFL Teacher Support Scale to evaluate the impact of AI-powered speaking assistance technologies, educational level, and learning setting on perceived teacher support. The findings revealed a tri-factorial structure underlying EFL teacher support, highlighting the compatibility of AI technologies with traditional pedagogical methods. This suggests that the integration of AI-powered tools into classroom instruction can enhance the overall effectiveness of language teaching and learning. To ensure optimal outcomes, educators are encouraged to strategically incorporate AI within pedagogically sound frameworks that maintain human-centered support. The study offers important implications for sustaining and enhancing teacher support in technology-enriched learning environments and underscores the need for further empirical research in this evolving area of applied linguistics and educational technology.

۸.

Transformer-Based Personality Trait Recognition Enhanced by Contextual Augmentation(مقاله علمی وزارت علوم)

نویسنده: حسین صابری رضا روانمهر

منبع: International Journal of Web Research, Volume ۹, Issue ۱, ۲۰۲۶ 1 - 24

کلیدواژه‌ها: Personality recognition Natural Language Processing transformer models Electra Big Five Personality Traits Computational Psychology

حوزه‌های تخصصی:

حوزه‌های تخصصی علم اطلاعات و دانش‌شناسی

تعداد بازدید : ۱۲ تعداد دانلود : ۱۲

psychological research, it often suffers from label interference, vocabulary-driven overfitting, and limited labeled datasets. As a result, models are brittle: they can fail with small training samples and behave inconsistently across trait ranges. To address this, we employ a practical single-trait approach that uses five independent ELECTRA-based classifiers, each corresponding to one of the big five dimensions, and trained them as separate binary tasks to prevent cross-trait interference. To reduce lexical bias and double the Pennebaker and King essay corpus from 2,467 to 4,934 samples, the team applied careful synonym-replacement augmentation using WordNet and additionally incorporated contextual augmentation generated by the Gemma model. Models were adjusted methodically to ensure fair comparisons. With test AUCs above 0.75, the ensemble achieves an average test accuracy of 0.724 on the Pennebaker and King benchmark, with per-trait accuracies of 0.72, 0.71, 0.74, 0.73, and 0.72 for openness, conscientiousness, extraversion, agreeableness, and neuroticism (OCEAN), respectively. These results substantially reduce inter-trait interference while matching or surpassing LIWC baselines and other transformer approaches.