مقایسه عملکرد رویکردهای کشف و استخراج موضوعات کتاب‌های الکترونیکی

نویسندگان

1 دانشگاه اصفهان، اصفهان، ایران

2 دانشکده علوم تربیتی و روان‌شناسی، دانشگاه اصفهان، اصفهان، ایران

3 گروه هوش مصنوعی، دانشکده مهندسی کامپیوتر، دانشگاه اصفهان، اصفهان، ایران

چکیده

استخراج کلمات کلیدی از مسائل مهم در زمینه پردازش و تحلیل متن بوده و خلاصه‌ای سطح بالا و دقیق از متن ارائه می‌دهد. بنابراین، انتخاب روش مناسب برای استخراج کلمات کلیدی متن حائز اهمیت است. هدف پژوهش حاضر، مقایسه عملکرد سه رویکرد درکشف و استخراج کلیدواژه‌های موضوعی کتاب‌های الکترونیک با استفاده از تکنیک‌های متن‌کاوی و یادگیری ماشین است. در این راستا سه رویکرد آزمایشی شامل، 1) اجرای متوالی فرایند خوشه‌بندی، ارتقای کیفیت خوشه‌ها از نظر معنایی و غنی‌سازی کلمات توقف حوزه خاص، 2) استفاده از الگوی کلیدواژه‌های تخصصی، 3) استفاده از بخش‌های مهم متن در کشف و استخراج واژگان کلیدی و موضوعات مهم متن معرفی و مقایسه شده است. جامعه آماری شامل 1000 عنوان کتاب الکترونیک از زیرشاخه‌های موضوعی حوزه علم اطلاعات و دانش‌شناسی بر اساس نظام رده‌بندی کنگره است که بعد از کسب اطلاعات کتابشناختی آن از پایگاه کتابخانه کنگره، اقدام به تهیه متن اصلی گردید. استخراج کلیدواژه‌های موضوعی و خوشه‌بندی داده‌های آموزش به ‌کمک الگوریتم تجزیه نامنفی ماتریس و با سه رویکرد آزمایشی انجام شد و کیفیت و عملکرد خوشه‌های موضوعی حاصل از اجرای سه رویکرد در بخش دسته‌بندی خودکار داده‌های آزمایشی به ‌کمک ماشین بردار پشتیبان مقایسه شد. یافته‌ها نشان داد که افت همینگ (020/0) یا میزان خطا در دسته‌بندی صحیح متون آزمایشی در رویکرد سوم یعنی بهره‌گیری از بخش‌های مهم متن در استخراج کلیدواژه‌های موضوعی، از دو رویکرد دیگر کمتر است. همچنین امتیازF1  (82/0) که میانگین دو معیار دقت (87/0) و بازخوانی (78/0) و بازتابی از عملکرد درست فرایند دسته‌بندی در برچسب‌گذاری موضوعی متون است، در رویکرد سوم بهتر از نتایج دو رویکرد دیگر است. نتایج تحلیل‌ها نشان داد که کیفیت و انسجام معنایی خوشه‌های موضوعی حاصل از رویکرد سوم، یعنی استفاده از بخش‌های مهم متن در کشف و استخراج موضوع، در مقایسه با دو رویکرد دیگر بهتر بود. افزون ‌بر این، کلیدواژه‌های به‌دست‌آمده از خوشه‌های موضوعی رویکرد سوم را می‌توان در مجموعه‌های توصیف‌نشده و ناشناخته به‌منظور استخراج محتوای موضوعی ناآشکار کل مجموعه به‌کار برد.

کلیدواژه‌ها


عنوان مقاله [English]

Comparison of the performance of approaches in discovering and extracting e-book topics

نویسندگان [English]

  • Fatemeh Zarmehr 1
  • Ali Mansouri 2
  • Hossein Karshenas Karshenas 3
چکیده [English]

Keyword extraction is one of the most important issues in text processing and analysis and provides a high-level and accurate summary of the text. Therefore, choosing the right method to extract keywords from the text is important. The aim of the present study was to compare the performance of three approaches in discovering and extracting the subject keywords of e-books using text mining and machine learning techniques. In this regard, three experimental approaches have been introduced and compared including the successive implementation of the clustering process, improving the quality of clusters in terms of semantics and enriching the stop words of a specific field, use of specialized keyword template, finally, the use of important parts of the text in discovering and extracting key words and important topics of the text. The statistical population includes 1000 e-book titles from the subject fields of library and information science based on the congress classification system. Bibliographic information of e-books was obtained from the Congress Library database, then the original text was prepared. The extraction of topic keywords and clustering of training data was performed using the non-negative matrix factorization algorithm with three experimental approaches. The quality and performance of the subject clusters resulting from the implementation of three approaches in the automatic classification of experimental data were compared using a support vector machine. The findings showed that the Hamming loss (0.020) and in other words the error rate in the correct classification of experimental texts in the third approach is far less than the other two approaches. Also, the F1 score (0.82), which is the average of the two criteria of precision (0.87) and recall (0.78) and is a reflection of the correct performance of the classification process in topic labeling of texts, is better in the third approach than the other two approaches. The results showed that the quality and semantic coherence of the subject clusters obtained from the third approach, i.e. the use of important parts of the text in discovering and extracting the subject, was better compared to other two approaches. In this approach, by focusing on the main parts of the data, which represent the main content and theme of the text, more meaningful topic clusters were obtained. In addition, the keywords obtained from the topic cluster of the third approach can be used in unspecified and unknown collections in order to extract the unknown thematic content of the whole collection. The results of third approach also was better in terms of accuracy and readability (0.79) and the rate of classification error (0.020) of texts, in comparison of other two approaches.

کلیدواژه‌ها [English]

  • E-book
  • Extraction
  • Subject Keywords
  • Text Mining
  • Subject Modeling
انبایی فریمانی، سعیده، حمید طباطبائی، و مجتبی کفاشان کاخکی. 1398. جستاری بر فرایند سازماندهی و بازیابی متون وبی مبتنی ‌بر تجمیع مفاهیم معنایی. پژوهشنامه پردازش و مدیریت اطلاعات 34 (4): 1879-1904. https://jipm.irandoc.ac.ir/article-1-4151-fa.html (دسترسی در 13/2-/1401)
باغ محمد، مریم، علی منصوری، و مهرداد چشمه‌سهرابی. 1399. بررسی توسعه و روند موضوعی حوزه علم اطلاعات و دانش‌شناسی بر اساس مدل موضوعی LDA. پژوهشنامه پردازش و مدیریت اطلاعات 36 (2): 297-328. https://jipm.irandoc.ac.ir/article-1-4480-fa.html (دسترسی در 5-/3/1401)
Allahyari, M., P. Pouriyeh, M. Assefi, A. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut. 2017. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. e-prints. arXiv: 1707.02919.
Ardimento, P., M. Bilancia, S. Monopoli. 2016. Predicting bug-fix time: Using standard versus topic-based text categorization techniques. In International Conference on Discovery Science. Cham, Germany: Springer. 167–182. doi:10.1007/978-3-319-46307-0_11
Basaldella, M., E. Antolli, G. Serra, & C. Tasso. .2018. Bidirectional LSTM Recurrent Neural Network for Keyphrase Extraction. Springer International Publishing. doi:10.1007/978-3-319-73165-0_18 
Beliga S., A. Meštrović, & S. Martinčić-Ipšić. 2015. An overview of graph-based keyword extraction methods and approaches. J Inform Organ Sci. 39 (1):1–20. https://www.researchgate.net/publication/280092953_An_Overview_of_Graph- Based_Keyword_Extraction_Methods_and_Approaches (Accessed Jul. 20, 2021)
Berger A, & J. Lafferty. 2017. Information retrieval as statistical translation. ACM SIGIR Forum. 51 (2): 219–26. https://doi.org/10.1145/3130348.3130371. 
Blei, D. M., A. Y. Ng, & M. I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3: 993-1022.
Casalino, G., C. Castiello, N. Del Buono, & C. Mencar. 2018. A framework for intelligent Twitter data analysis with non-negative matrix factorization. Int. J. Web Inf. Syst. 14: 334–356. https://doi.org/10.1108/IJWIS-11-2017-0081
Chen, Y., & S. Li. 2016. Using latent Dirichlet allocation to improve text classification performance of support vector machine. Paper presented at the 2016 IEEE Congress on Evolutionary computation (CEC). Vancouver, BC, Canada. doi:10.1109/CEC.2016.7743935.
(Accessed June 2, 2022)
Chien, J.-T., C.-H. Lee, & Z.-H. Tan. 2018. Latent Dirichlet mixture model. eurocomputing. 278: 12-22. doi:https://doi.org/10.1016/j.neucom.2017.08.029 
Cohen, J. D. 1995. Highlights: Language-and domain-independent automatic indexing terms for abstracting. J. Am. Soc. Inf. Sci. 46: 162–174.
Choi, Y., I. Hsieh-Yee, & B. Kules. 2007. Retrieval effectiveness of table of contents and subject headings. JCDL '07 June 18–23, 2007, Vancouver, British Columbia, Canada (pp.103-104). doi:10.1145/1255175.1255195 
De Nart, D., D. Degl’Innocenti, A. Pavan, M. Basaldella, & C. Tasso. 2015. Modelling the User Modelling Community (and Other Communities as Well). In: Ricci, F., Bontcheva, K., Conlan, O., Lawless, S. (eds) User Modeling, Adaptation and Personalization. UMAP 2015. Lecture Notes in Computer Science (), vol 9146. Springer, Cham. https://doi.org/10.1007/978-3-319-20267-9_31
Dennis, S. F. 1967. The Design and Testing of a Fully Automatic Indexing-Searching System for Documents Consisting of Expository Text. In Information Retrieval: A Critical Review. Washington, DC, USA: Thompson Book Company. 67–94.
Elhadad, M. K., K. Badran, & G. I. Salama. 2017. A novel approach for ontology-based dimensionality reduction for web text document classification. Paper presented at the 2017 IEEE/ ACIS 16th International Conference on Computer and Information Science (ICIS). Wuhan, China.
Ercan, G., & I. Cicekli. 2007. Using lexical chains for keyword extraction. Inf. Process. Manag. 43: 1705–1714. https://doi.org/10.1016/j.ipm.2007.01.015. 
George, S., & V. Srividhya. 2020. Comparison of LDA and NMF Topic Modeling Techniques for restaurant reviews. Indian Journal of Natural Sciences 10 (62): 28210–28216. https://www.researchgate.net/publication/350236296_Comparison_of_LDA_and_NMF_Topic_Modeling_Techniques_for_Restaurant_Reviews (Accessed Oct. 15, 2021)
Gers, F. A. 2002. Learning precise timing with LSTM recurrent networks. Journal of machine learning research 3: 115-143. https://doi.org/10.1162/153244303768966139
Goh, R. 2018. Using Named Entity Recognition for Automatic Indexing. Paper presented at the IFLA WLIC, Kuala Lumpur, Malaysia.
Gupta, S., V. Kumar, & B. Pant. 2018. Classification of Textual Data in Distributed Environment. 2018 Second International Conference on Advances in Computing, Control and Communication Technology (IAC3T), Allahabad, India. 120-124.
Graves, A and J. Schmidhuber.2005. "Framewise phoneme classification with bidirectional LSTM networks," Proceedings. IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 2005, 2047-2052, 4, doi:10.1109/IJCNN.2005.1556215.
Han. J., and M. Kamber. 2006. Data mining: concepts and techniques, 2nd ed. San Francisco: Morgan Kaufmann, 227-228, 615-631.
Hjørland, B. 2019. Indexing: Concepts and theory. In ISKO Encyclopedia of Knowledge Organization, eds. Birger Hjørland, coed. Claudio Gnoli. http://www.isko.org/cyclo/indexing (Accessed March 18, 2021)
Holley R. M., & D. N. Joudrey. 2020. Aboutness and conceptual analysis: A review. Cataloging & Classification Quarterly doi:10.1080/01639374.2020.1856992 
Hoyt, B. 2020. Best practices forlti, l kdsj. Gxhtjk il fhan. gxht content manager on demand full text search. https://www.ibm.com/support/pages/sites/default/files/inlinefiles/Best%20practices%20for%20Using%20Full%20Text%20Searching%20with%20Content%20Manager%20OnDemand-4-22-2020.pdf (Accessed July 9, 2021)
Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Strauzburg, PA, USA, 216–223.
Kaur, A, D. and Chopra. 2016. "Comparison of text mining tools,". 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 2016, 186-192, doi:10.1109/ICRITO.2016.7784950.
Langville, A. N., C. D. Meyer, R. Albright, I. Cox, and D. Duling. 2014. Algorithms Initialization ad Convergenece for non-negative matrix factorization. In proceedings of 12th ACM SIGKDD international conference of knowledge and description & data mining. SAS Technical Report. 54-91. doi:10.48550/arXiv.1407.7299.
Lee, D., & H. Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791.https://doi.org/10.1038/44565.
Li. Z, W. Shang, & M. Yan. 2016. News text classification model based on topic model, 2016 IEEE/ ACIS 15th International Conference on Computer and Information Science (ICIS), IEEE, 1–5. Okayama, Japan, 2016, 1-5, doi:10.1109/ICIS.2016.7550929.
Maguduru, N. 2003. Text Mining With Support Vector Machins and Non negative matrix factorization. Linear Multilinear Algebra 51 (1): 83-95.
Matsuo, Y., & M. Ishizuka. 2004. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on artificial intelligence tool. 13 (1): 157–169. https://doi.org/10.1142/S0218213004001466.
Meng, L., R. Huang, & J. Gu. 2013. A review of semantic similarity measures in WordNet. Journal of Intelligent & Fuzzy Systems 36 (4): 3045-3059
Miner, G., IV. Elder J., F. Fast,T. Hill, R. Nisbet, & Delen. 2012. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. USA:Academic Press.
Mouhoub. M., and M. Al Helal. 2018. Topic Modelling in Bangla Language: An LDA Approach to Optimize Topics and News Classification. Compute. Inf. Sci. 11 (4): 77–83.
Newman, D., J. H. Lau, K. Grieser, & T. Baldwin. 2010. Automatic Evaluation of Topic Coherence. Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics. Los Angeles, California, USA.
Onan, A. 2017. Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes 46 (2): 330–48.
_____. 2018a. An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science. 44 (1): 28–47. https://doi.org/10.1177/01655515166779
_____. 2018b. Sentiment analysis on twitter based on ensemble of psychological and linguistic feature sets. Balkan J Electr Comput Eng. 6: 1–9.
_____, & S. Korukoglu. 2017. A feature selection model based on genetic rank aggregation for text sentiment classification. J Journal of Information Science 43 (1): 25–38.
_____, & M. A. Toçoğlu. 2020. Weighted word embeddings and clustering‐based identification of question topics in MOOC discussion forum posts. Computer Applications in Engineering Education 29: 675 - 689.
Paatero. P, & U. Tapper. 1994. Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics 1: 111-126.
Pokorny, J. 2018. Automatic subject indexing and classification using text recognition and computer based analysis of the table of content. In Chau, L. & Mounier, P. ELPUB 2018. June 2018, Tornto, Canada. doi:10.4000/proceedings.elpub.2018.19
Salton, G., & C. Buckley. 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24: 513–523.
_____. 1991. Automatic Text Structuring and Retrieval-Experiments in Automatic Encyclopaedia Searching. Ithaca, NY, USA: Cornell University.
Srivastava, S. Singh, & J. S. Suri. 2019. Effect of incremental feature enrichment on healthcare text classification system: A machine learning paradigm. Computer methods and Programs in Biomedicine 172: 35–51. https://doi.org/10.1016/j.cmpb.2019.01.011.
Syed, Sh., and M. Spruit. 2017. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Tokyo, Japan, 2017, pp. 165-174. doi:10.1109/DSAA.2017.61
Thiyagarajan, D., & N. Shanthi. 2017. A modified multi objective heuristic for effective feature selection in text classification. Cluster Computing journal 11: 1-11, doi/10.1007: s10586-017-1150-7.
Wang, L &, S. Li. 2017. Keyphrase Extraction with Model Ensemble and External Knowledge. Proceedings of the 15th International Workshop on Semantic Evaluations. Association for Computational Linguistics. 934–937. doi:10.18653/v1/S17-2161.
_____, X., X. Tang, W. Qu, & M. Gu. 2017. Word sense disambiguation by semantic inference. In Proceedings of the 2017 International Conference on Behavioral, Economic, Socio-cultural Computing (BESC), Krakow, Poland, 1–6.
_____, X., L. Zhang, D. Klabjan. 2020. Keyword-based Topic Modeling and Keyword Selection. Computer Science, Mathematics.2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA. 1148-1154, doi:10.1109/BigData52589.2021.9671416.
Wilson, A. T., & P. A. Chew. 2010. Term weighting schemes for Latent Dirichlet Allocation. Paper presented at the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles, California.
Zhang, Y. A.-H. 2004. World wide web site summarization. Web Intelligence and Agent Systems 2 (1): 39- 53
_____, C. 2008. Automatic keyword extraction from documents using conditional random fields. J. Comput. Inf. Syst. 4: 1169–1180.
_____, Q., Y. Wang, Y. Gong, & X. Huang. 2018. Key phrase Extraction Using Deep Recurrent Neural Networks on Twitter. Proceedings of the 8102Conference on Empirical Methods in Natural Language Processing. macao, china. 836-845. doi:10.18653/v1/D16-1080
Zhao, R., & K. Mao. 2018. Fuzzy Bag-of-Words Model for Document Representation. IEEE Transactions on Fuzzy Systems 26 (2): 794-804. doi:10.1109/TFUZZ.2017.2690222.