پیکره ساخت های فعل سبک زبان فارسی (مقاله علمی وزارت علوم)
درجه علمی: نشریه علمی (وزارت علوم)
آرشیو
چکیده
پیکره زبانی مجموعه ای بزرگ از داده های زبانی مبتنی بر کاربرد سخنوران زبان هاست که الگوهای واقعی کاربرد زبانی را در اختیار پژوهشگران قرار می دهند. برتری پیکره ها در مقایسه با سایر منابع داده ای علاوه بر حجم زیاد داده ، ایجاد امکان به کارگیری رایانه در بررسی های زبانی است. مقاله حاضر به معرفی اولین پیکره ساخت های فعل سبک زبان فارسی می پردازد. آشنایی با ماهیت این ساخت ها و دسترسی به فهرستی از آن ها، علاوه بر اهمیت نظری به لحاظ کاربردی نیزحائز اهمیت است. این یافته ها در حوزه بررسی های هوش مصنوعی مرتبط با پردازش زبان های طبیعی، ترجمه ماشینی، آموزش زبان فارسی، دستورنویسی و فرهنگ نگاری کاربرد می یابد. پیکره هدف این پژوهش «پیکره زبانی ساخت های فعل سبک زبان فارسی» یا LCP نام دارد. برای ایجاد آن پیکره تک زبانه پژوهشگاه ارتباطات و فناوری اطلاعات (بی جن خان، 1397) که حاوی 950000 فایل متنی است، به عنوان پیکره مبنا برگزیده شد. ساخت های فعلی مرکب مربوط به 21 فعل سبک زایای زبان فارسی از آن استخراج شده است و پس از برچسب زنی در چارچوب صرف توزیعی (Halle & Marantz, 1993; Marantz, 2013) در قالب پیکره ای مشتمل بر بیش از 6000 ساخت فعل سبک در بیش از 2000000 بافت زبانی ارائه شده است که در بیش از 200000 بافت زبانی ارائه شده اند. مقایسه تعداد فعل های واژگانی زبان فارسی با تعداد ساخت های فعل سبک موجود در پیکره حاضر، بدیهی ترین عاملی است که وجود چنین پیکره ای در میان منابع زبان فارسی را ضرورت می بخشد. از سوی دیگر، ماهیت این پیکره، یعنی نمایش ساخت های فعل سبک در بافت های زبانی متفاوت، می تواند به پژوهشگران در یافتن پاسخ پرسش های موجود در رابطه با این ساخت ها، رد یا تأیید فرضیه ها و طرح نظریه های جدید کمک کند.A Corpus of Light Verb Constructions in Persian
Abstract
A linguistic corpus is a collection of linguistic data derived from language texts, which represent the real patterns of language use to the researchers. The priority of the corpus over other linguistic resources stems from the amount of data it represents and the possibility of computer use in linguistic studies. In the present study, an annotated monolingual linguistic corpus of Light Verb Constructions (LVCs) of Persian language (LCP) developed by the authors was introduced. The corpus contained more than 6000 LVCs, which were used in more than 2000000 linguistic contexts. Just a comparison of the number of LVCs with the number of simple verbs in Persian is enough to indicate the importance of these types of language resources. This annotated corpus presented LVCs formed by 21 Persian Light Verbs (LVs) that are used in real contexts. This unprecedented work has the capacity to easily provide a large computational bulk of various data for the researchers to assess the existing hypotheses and put forward the new ones.
Keywords: Persian Language, Language Resources, Linguistic Corpus, Light Verb Constructions, Natural Language Processing
Introduction
Light verbs are a group of verbs that have lost part of their semantic contents during language evolution. These so-called light verbs in combination with a preverbal element like a noun, adjective, or prepositional phrase form Light Verb Constructions (LVCs) in Persian. The study of LVCs is important not only theoretically, but also practically. The verbal system of Persian largely consists of LVCs and it doubles the importance of their study in this language. Nevertheless, many studies have pointed out the challenges that Persian LVCs pose for computational systems. They have emphasized the lack of appropriate computer resources and the necessity of studies that provide the researchers with their standard language patterns in this language (Maerefat, 2004; Hasas Sediqi, 2010; Taslimipoor, 2012; Askariyan, 2012, and Barfi, 2016 among others). Although there are already valuable Persian corpora developed by specialists like Bijan Khan (2004, 2018), Asi (2005), and Al-e-Ahmad et al. (2010) in this field, there is no corpus to comprehensively represent LVCs of all productive Persian Light Verbs (LVs). The only available corpus dealing with Persian LVCs is PresPred (Samvellian & Faqiri, 2013), which represents those consisting of one of the twenty-one productive Persian LVs (Zadan). To address this need, we developed the first corpus for Persian LVCs.[1] This annotated corpus presented the LVCs formed by 21 Persian LVs that are used in real contexts. The present unprecedented work has the capacity to readily provide a large computational bulk of various data for researchers.
Materials and Methods
Development of the present corpus experienced the following steps: designing the structure of the corpus, selecting a corpus as a basis, normalizing the texts, defining the search nodes, writing macro codes in Visual Basic Analysis (VBA) language for preparing the search software, extracting all the sentences containing the verbs under investigation (regardless of being light or lexical verbs), extracting the sentences with LVCs, and finally selecting an annotation model and applying it to the results. It was designed to be a synchronic monolingual corpus of Persian LVCs. We chose a corpus developed by Bijan Khan (2018) as a basis. It was developed in the Research Institute of Information and Communication Technology and contained 950000 text files. First, we normalized the texts and then used VBA macro codes to extract the LVCs consisting of 21 Persian LVs ( da:shtan: have, kardan: do, shodan: become, gashtan: turn, goza:shtan: put, keshidan: pull, didan: see, da:dan: give, bakhshidan: give, grant, gereftan: get, yaftan: obtain, ?a:madan: come, ?a:vardan: bring, residan: arrive, raftan: go, ?ofta:dan: fall, ?anda:khtan: throw, bordan: take, khordan: collide, zadan: hit, and bastan: tie). then, constituency test (topicalization, coordination, deletion, and substitution) was applied to distinguish LVCs from lexical verbs. Annotation of LVCs has been done at the word level within a Distributed Morphology setting (Halle & Marantz, 1993 and Marantz, 2013). Preverbal elements and LVs were considered as categoryless elements (annotated as Pre-Verbs (PVs)) and categorizers (annotated as LVs), respectively. In addition, the present and past lemmas of each LVC were given and their separability/inseparability was annotated as SEP/INSEP. It should be noted that in line with Karimi-Doostan (2011), the cases, in which preverbal elements and LVs were broken by a negative particle (neg), the imperfective morpheme (mi), modals and auxiliaries, such as ba:yad (should, must), xa:stan (will) as a future auxiliary verb, and da:sˇtan (to have) as a progressive auxiliary verb, as well as clitic pronouns like –esˇ (it), were annotated as INSEP. Table 1 represents these tags and the colors used for each of them.
Discussion of Results and Conclusion
Light Verb Constructions (LVCs) as a subset of complex or multi-word predicates are among the most challenging topics of language. The present study developed a monolingual corpus of Persian LVCs with the aim of providing the researchers with a large computational bulk of data related to these challenging constructions and improving the authenticity of the studies conducted in this field. The present corpus included about 6000 LVCs in more than 2000000 contexts. In contrast, the number of Lexical verbs in Persian is about 200. The comparison highlighted how significant this kind of linguistic resource could be for a language and its researchers. They can be used in machine translation, artificial intelligence and language processing programs, data recovery programs, language learning, grammar books, and dictionaries.
[1] . The corpus of Light Verb Constructions of Persian is available at https://literature.ut.ac.ir/compound-verb .