تخمین تعداد موضوعات در مدلسازی موضوعی روی مقالات علمی فارسی

مظفری, نیلوفر

doi:10.22034/jipm.2023.701394

تخمین تعداد موضوعات در مدلسازی موضوعی روی مقالات علمی فارسی

نوع مقاله : مقاله پژوهشی

نویسنده

نیلوفر مظفری

مرکز منطقه ای اطلاع رسانی علوم و فناوری

10.22034/jipm.2023.701394

چکیده

این مقاله روشی را برای یافتن تعداد موضوعات در مقالات علمی فارسی ارائه می‌دهد که یکی از چالش‌های اصلی در مدل‌سازی موضوعی است و در واقع، فرایند تشخیص خودکار موضوعات در یک متن با هدف کشف الگوهای پنهان است.
پژوهش حاضر از نوع کاربردی است که با مقایسه دو روش، یکی مبتنی ‌بر «گریدی» و دیگری مبتنی ‌بر نظریه بازبهنجاری پارامتر تعداد موضوعات را برای مقالات نشریات فارسی تخمین می‌زند. روش «گریدی» با تعریف یک معیار برای ارزیابی مدل موضوعی و به‌دست آوردن این معیار با توجه به مقادیر مختلف تعداد موضوعات می‌تواند تعداد موضوعات بهینه را تخمین بزند. الگوریتم دیگر مبتنی ‌بر نظریه بازبهنجاری است که در واقع، یک فرمولاسیون ریاضی برای ساخت یک رویّه برای تغییر مقیاس سیستم تحت بررسی است به‌صورتی که رفتار سیستم حفظ شود و تغییری در روند آن ایجاد نشود. با استفاده از این نظریه و استفاده از اطلاعات مرحله قبل می‌توان تعداد موضوعات را با سرعت تخمین زد. همچنین، مدت زمان اجرای هر دو الگوریتم روی مقالات نشریات مختلف فارسی، ارائه و با یکدیگر مقایسه شده است.
یافته‌ها نشان‌دهنده کارایی روش مبتنی ‌بر نظریه بازبهنجاری در تخمین تعداد موضوعات موجود در مقالات نشریات فارسی است.
نتایج نشان می‌دهد که روش مبتنی ‌بر نظریه بازبهنجاری نسبت به روش «گریدی» با سرعت بالاتری می‌تواند تعداد موضوعات را تخمین بزند. از این روش می‌توان پارامتر تعداد موضوعات در مقالات نشریات فارسی را تخمین زد که در نهایت، به مدل‌سازی موضوعی نشریات فارسی با توجه به مقالات چاپ‌شده در آن منجر می‌شود.

کلیدواژه‌ها

نظریه بازبهنجاری

آنتروپی رونو

جست‌وجوی گریدی

توزیع دیریکله

عنوان مقاله English

Estimating Number of Topics in Topic Modeling on Persian Research Articles

نویسنده English

Niloofar Mozafari

چکیده English

This article presents a method to find the number of topics in Persian research articles, which is actually one of the main challenges in topic modeling. It is the process of automatically recognizing topics in a text with the aim of discovering hidden patterns.
This study has estimated the number of topics for Persian research articles using two approaches. The first is based on the greedy search and later uses Renormalization theory, which is a mathematical formalism to construct a procedure for changing the scale of the system so that the behavior of the system preserves. Also, the execution time of both algorithms on Persian academic articles has been compared with each other.
The findings indicate that the renormalization approach predicts the number of topics in Persian research articles with the lower time complexity in comparison to the greedy based approach.
The approach based on Renormalization has high efficiency for estimating the number of topics in Persian academic articles.

کلیدواژه‌ها English

Renormalization Theory

Rényi Entropy

Grid Search. Latent Dirichlet Allocation

اسدی قادیکلایی، ام‌البنین، نجلا حریری، مریم خادمی، و فهیمه باب‌الحوائجی. 1400. مدل‌سازی موضوعی مقالات پژوهشگران ایران در حوزه غدد درون‌ریز و متابولیسم در پایگاه استنادی وب علوم. پژوهشنامه علم‌سنجی. صفحه 8، (15) شماره پیاپی 15: 8-49. DOI:10.22070/RSCI.2020.5813.1432.

دامی، سینا، و محمدرضا الیکایی. 1396. مدل‌سازی موضوعی رویدادهای اخبار مبتنی ‌بر یادگیری عمیق افزایشی. چهارمین کنفرانس بین‌المللی مطالعات نوین در علوم کامپیوتر و فناوری اطلاعات. مشهد.

دامی، سینا، و سید احمد طاهرزاده. 1396. شناسایی تهدیدهای امنیتی با استفاده از مدل‌سازی موضوعی LDA و ماشین بردار پشتیبان. کنفرانس ملی فناوری‌های نوین در مهندسی برق و کامپیوتر. اصفهان.

رحیمی، مرضیه، مرتضی زاهدی، و هدی مشایخی. 1397. یک مدل موضوعی احتمالاتی مبتنی ‌بر روابط محلی واژگان در پنجره‌های همپوشان. پردازش علائم و داده‌ها 4، پیاپی 38: 57-70. DOI:10.29252/jsdp.15.4.57

زرمهر، فاطمه، علی منصوری، و حسین کارشناس. 1400. مدل‌سازی موضوعی و کاربرد آن در پژوهش‌ها؛ مروری بر ادبیات تخصصی. پژوهشنامه کتابداری و اطلاع‌رسانی 11 (1): 23-39. DOI:10.22067/infosci.2021.24128.0

زمانی، محسن، روح‌الله دیانت، و مهدی صادق‌زاده. 1393. دسته‌بندی متون فارسی با استفاده از روش آنالیز معنایی پنهان احتمالاتی. اولین همایش ملی کاربرد سیستم‌های هوشمند (محاسبات نرم) در علوم و صنایع. قوچان.

شکری، سعید، و بهروز معصومی. 1395. خوشه‌بندی معنایی متن با استفاده از تخصیص پنهان دیریکله و الگوریتم ژنتیک. چهارمین کنفرانس بین‌المللی پژوهش در علوم و تکنولوژی. ترکیه.

گیلوری، عباس. 1379. نمایه‌سازی خودکار (گذشته، حال، آینده). تحقیقات اطلاع‌رسانی و کتابخانه‌های عمومی (پیام کتابخانه سابق) 39: 17-25.

هاشم‌زاده، محمدجواد، زینب نخعی، و حسین مرادی‌ مقدم. 1392. کاربرد و تعدیل قانون زیف و الگوی بازو در بازشناسی واژه‌های بازدارنده زبان فارسی با استفاده از خوشه زبانی مقالات علمی-پژوهشی رشته کتابداری و اطلاع‌رسانی. پژوهشنامه کتابداری و اطلاع‌رسانی 3 (2): 191-208.

Barbieri, N., G. Manco, F. Ritacco, M. Carnuccio, & A. Bevacqua. 2013. Probabilistic topic models for sequence data. Machine learning 93 (1): 5-29. DOI:10.1007/s10994-013-5391-2.

Blei, D. M. 2012. Probabilistic topic models. Communications of the ACM 55 (4): 77-84. DOI:10.1145/2133806.2133826

Blei, D. M., A. Y. Ng, & M. J. Jordan. 200). Latent dirichlet allocation. The Journal of machine Learning research 3: 993-1022.

Chang, J., S. Gerrish, C. Wang, J. L. Boyd-Graber, & D. M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems (pp. 288-296). Vancouver, British Columbia, Canada.

Cheng, X., Q. Cao, & S. Liao. 2022. An overview of literature on COVID-19, MERS and SARS: Using text mining and latent Dirichlet allocation. Journal of Information Science 48 (3): 304-320. DOI:10.1177/0165551520954674

Davarpanah, M. R., M. Sanji, & M. Aramideh. 2009. Farsi lexical analysis and stop word list. Library Hi Tech. DOI:10.1108/07378830910988559.

De Finetti, B. 2017. Theory of probability: A critical introductory treatment (Vol. 6). United Kingdom: John Wiley & Sons.

: John Wiley & Sons.

Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, & R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science 41 (6): 391-407. DOI:10.1002/(SICI)1097-4571

Dudoit, S., J. Fridlyand, & T. P. Speed. 2002. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association 97 (457): 77-87. DOI:10.1198/016214502753479248.

Griffiths, T. L., M. Steyvers, & J. B. Tenenbaum. 2007. Topics in semantic representation. Psychological review 114 (2): 211. DOI:10.1037/0033-295X.114.2.211

Hofmann, T. 2013. Probabilistic latent semantic analysis. arXiv preprint arXiv:1301.6705. DOI:10.48550/arXiv.1301.6705.

Jameel, S., W. Lam, & L. Bing. 2015. Supervised topic models with word order structure for document classification and retrieval learning. Information Retrieval Journal 18 (4): 283-330. DOI:10.1007/s10791-015-9254-2.

Kadanoff, L. P. 2000. Statistical physics: statics, dynamics and renormalization. World Scientific Publishing Company DOI:10.1142/4016.

Kherwa, P., & P. Bansal. 2017. Latent Semantic Analysis: An Approach to Understand Semantic of Text. In 2017 International Conference on Current Trends in Computer, Electrical, Electronics and Communication (CTCEEC) (pp. 870-874). IEEE. DOI:10.1109/CTCEEC.2017.8455018.

Kherwa, P., & P. Bansal. 2020. Topic Modeling: A Comprehensive Review. EAI Endorsed Transactions on Scalable Information Systems 7 (24). DOI:10.4108/eai.13-7-2018.159623.

Koltcov, S. N. 2017. A thermodynamic approach to selecting a number of clusters based on topic modeling. Technical Physics Letters 43 (6): 584-586. DOI:10.1134/S1063785017060207.

_____, & V. Ignatenko. 2020. Renormalization approach to the task of determining the number of topics in topic modeling. In Science and Information Conference (pp. 234-247). Springer, Cham. DOI:10.1007/978-3-030-52249-0_16.

_____, & O. Koltsova. 2019. Estimating Topic Modeling Performance with Sharma–Mittal Entropy. Entropy 21 (7): 660. DOI:10.3390/e21070660.

Lee, D. D., & H. S. Seung. 2001. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems (pp. 556-562).

Noji, H., D. Mochihashi, & Y. Miyao. 2013. Improvements to the Bayesian topic n-gram models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1180-1190). Washington, USA.

Röder, M., A. Both, & A. Hinneburg. 2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). Shanghai, China.

Sadeghi, M., & J. Vegas. 2014. Automatic identification of light stop words for Persian information retrieval systems. Journal of information science 40 (4): 476-487.

Sato, I., & H. Nakagawa. 2010. Topic models with power-law using Pitman-Yor process. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 673-682). Washington, DC, USA.

Sievert, C., & K. Shirley. 2014. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the workshop on interactive language learning, visualization, and interfaces (pp. 63-70). Baltimore, Maryland, USA.

Wang, X., A. McCallum, & X. Wei. 2007. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Seventh IEEE international conference on data mining (ICDM 2007) (pp. 697-702). IEEE. Omaha, Nebraska, USA.

Wang, C., & D. M. Blei. 2011. Collaborative topic modeling for recommending scientific articles. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 448-456). San Diego, California USA.

Yang, G., D. Wen, N. S. Chen, & E. Sutinen. 2015. A novel contextual topic model for multi-document summarization. Expert Systems with Applications 42 (3): 1340-1352.