سازوکاری برای مدیریت زمان و افزایش دقت اطلاعات هنگام استفاده از کتابخانه سلنیوم (مقاله علمی وزارت علوم)

درجه علمی: نشریه علمی (وزارت علوم)

نویسندگان: فرناز تقی زاده کورایم محمدرضا کاباران زاده قدیم سید عبدالله امین موسوی

منبع: بازیابی دانش و نظام های معنایی سال 11 زمستان 1403 شماره 41

کلیدواژه‌ها: خزنده وب وب اسکرپینگ کتابخانه سلنیوم نامتقارن بودن داده دقت اطلاعات زمان برداشت اطلاعات

حوزه‌های تخصصی:

حوزه‌های تخصصی علم اطلاعات و دانش‌شناسی

doi: 10.22054/jks.2024.80235.1660

شماره صفحات: ۱۹۹ - ۲۲۵

دریافت مقاله تعداد دانلود : ۵۹

آرشیو

چکیده

امروزه داده ها به عنوان یکی از دارایی های ارزشمند سازمان ها و صنایع مختلف، نقش مهمی را در توسعه و پیشرفت کسب وکارها ایفا می کنند. درواقع هر سازمانی برای جمع آوری داده های خود از منابع مختلفی استفاده می کند که یکی از این منابع بستر وب است که در آن روزانه داده های زیادی توسط کاربران مختلف و یا حتی ربات ها در سراسر جهان تولید و منتشر می شود. جمع آوری و تحلیل چنین داده هایی، می تواند اطلاعات مفیدی را برای سازمان فراهم نماید. به همین منظور طی دهه های گذشته ابزارهای مختلفی توسعه یافته اند که به برداشت اطلاعات از بستر وب کمک شایانی نموده اند که ازجمله آن ها می توان به کتابخانه های ریکوئست، سلنیوم، اسکرپی، سوپ زیبا و ... در زبان برنامه نویسی پایتون اشاره نمود. بااین حال، هر یک از این کتابخانه ها با چالش هایی مواجه هستند. ما در این مقاله با مطالعه کتابخانه سلنیوم و با توجه به وجود چالش های متعدد در آن، راه حلی را برای مدیریت زمان و بهبود چالش نامتقارن بودن آن ارائه نموده ایم. آزمایش های ما نشان می دهد که استفاده از راه حل پیشنهادی، دقت اطلاعات برداشت شده از بستر وب را افزایش و درنتیجه چالش نامتقارن بودن را بهبود می دهد و همچنین زمان برداشت اطلاعات از بستر وب را نیز کاهش می دهد.

A Mechanism to Manage Time and Increase Data Accuracy When Using the Selenium Library

IntroductionThe Internet platform is a very powerful source of information that can be collected with the help of various tools and techniques and used after analysis in order to make better and more efficient decisions. According to previous researchers, when it comes to automatically extracting information from the web, Selenium is always the best option, however, this library has many challenges. One of the challenges of using the Selenium library is its asynchronous and the other is the slowness of the Selenium library, which we are trying to investigate and improve in this article. Research Question(s): How to improve the challenge of slowness and asynchronous of the Selenium library? Literature ReviewSelenium library, which is one of the best web scripting tools, has been used in different studies and with different purposes. This library is a free, open-source automated testing framework used to verify web applications across multiple platforms and browsers. Various programming languages such as Java, C#, and Python can be used to create Selenium test scripts (Teotia et al., 2023). But despite its many advantages, selenium also has disadvantages, including: 1. Slowness, 2. Brittleness, 3. Flakiness, 4. Maintainability, 5. Asynchronous, 6. Time-consuming, 7. Cross-browser, 8. failure analysis, 9. Infrastructure, 10. Scalability, 11. Assertability, 12. Documentation and 13. Support (Leotta et al., 2023). MethodologyThe first thing we examined in the Selenium library is the lack of time management. Time management actually refers to the fact that the waiting time for downloading information from the web platform in this library is not known. To solve the problem of slowness and asynchronous of the Selenium library, we have used a solution that includes three different steps:Step 1) According to the manual checks, first we define the variable t1:t1 = 0.5where t1 is the value used in the sleep function (time required to open the main page in normal mode).Step 2) We use the while loop and try, except inside it. If the page does not open after 3 seconds, or after 6 attempts with different sleep times, the desired page and products do not appear, the error of the site not being available or the internet being slow will be printed:t1 = 0.5try:while t1 <= 3:try:driver.get("""The Study site""") # to open the page time.sleep(t1) # Waiting for the page to open…# Code related to page scrollingtime.sleep(t1) # Waiting for information to be displayed after scrolling# Information collection codesexcept:t1 += 0.5Except:print('The internet speed is very slow or the intended site is down')In this code, the program with time t1=0.5 first tries to display the page information in full and if it fails, it adds half a second to t1 and this repetition continues up to 6 times. If the page is displayed in full, we use the new value of t1 for the next pages.Step 3) If the page opens, we will enter the third step, which is related to collecting the basic information of the products and we must avoid the problem of asynchronous.At this stage, according to the existence of five different types of information (such as product name, price, amount of discount, price after discount, type of discount) of each product, we first define five different and empty lists for product information.Then, with the help of commands related to information collection, we take the information about each product and put it in the corresponding list. Then we run the following script to prevent wrong items in the list: check1 = [[product_prices], [product_off], [product _prices2], [Product_off_type] ] for j in range(0,4):if len(product_name) < check1[len[j]]: product_name.append(‘error’)elif len(product_name) > check1[len[j]]: check[j].append(‘error’) In other words, after taking the information of a product and adding them to the predefined main lists, the program calculates the number of items in the existing lists with the help of the len function and puts them in the checklist. Then, with the help of the for loop, the length of each list is compared with the rest of the lists, and if the number of items in a list is low, the word "error" is added to it. ResultsIn order to evaluate the solutions presented in this research, we have reviewed the information related to supplementary drugs on the DigiKala site, which was almost 3000 different medicines, 13 times on 13 different dates, from September 23, 2023, to March 15, 2024. In this study, the codes written in order to retrieve information from the mentioned site were executed the first time with the proposed solution and the second time without the proposed solution, and in each execution, both the time of information retrieval and the number of information related to each list or the same column was recorded and compared with each other, and the error rate and its percentage were calculated based on the difference in the time of information collection in the first and second execution and the difference in the number of information collected for each column in the first and second execution. After the implementation and use of the proposed solution, the investigations show that the accuracy and correctness of the collected information have increased compared to not using the proposed solution, and the time of information collection has also improved.Discussion and ConclusionIn this article, we studied and evaluated the challenges of being slow, time-consuming and asymmetric of the Selenium library. Our studies were conducted using the Python programming language. Studies show that it is very important to use the solution of checking the list and the same length of the list at the end of the collection of each product from the web platform, so that not using it in 12 out of 13 cases of information collection from the web platform makes us encountered an error. Also, using a constant value for the sleep function significantly increases the time to retrieve information compared to using a variable value for it. In general, the findings show that the use of the proposed solution when using the Selenium library in order to extract information from the web platform helps to increase the accuracy of the information and also improves the time of complete information retrieval from the web platform.