Automatic Detection of the Boundary between Metadata and Body in Persian Theses using BA_SVM

Authors

Abstract

Metadata extraction facilitates the process of indexing and improves information retrieval. Also automation of this process increases efficiency more than manual extraction. The example of the thesis metadata are names of students, professors, title, field, degree, abstract, keywords, etc. In this paper the aim is automatic boundary detection of metadata from the main body in Persian theses. Therefore, 250 theses collected from IRANDOC system. Features were extracted from paragraphs of each thesis then paragraphs were classified using support vector machine into 2 classes: metadata and body. In this study, Bat algorithm is used to set the parameter of SVM. The result reveals that the proposed method predicts type of paragraphs with 96.6 percent accuracy.

Keywords


  1. تنسازان، امیر، و محمدامین مهدوی. 1396. استخراج فراداده‌های متنی از مقاله‌های علمی به زبان فارسی با مدل آماری CRF پژوهش‌های نظری و کاربردی در علم اطلاعات و دانش‌شناسی 7 (1): 321-304.
  2. تنسازان، امیر، و محمدامین مهدوی. 1396. استخراج فراداده‌های متنی از مقاله‌های علمی به زبان فارسی با مدل آماری CRF پژوهش‌های نظری و کاربردی در علم اطلاعات و دانش‌شناسی 7 (1): 321-304.
  3. Adrian, W., N. Leone, M. Manna, & C. Marte. 2017. Document Layout Analysis for Semantic Information Extraction. In: Esposito F., Basili R., Ferilli S., Lisi F. (eds) AI*IA 2017 Advances in Artificial Intelligence. AI*IA 2017. Lecture Notes in Computer Science, vol 10640. Springer, Cham. Boukhers, Z., S. Ambhore, & S. Staab,. 2019. An end-to-end approach for extracting and segmenting high-variance references from pdf documents. ACM/IEEE Joint Conference on Digital Libraries (JCDL). Champaign, IL, USA. https://doi.org/10.1109/JCDL.2019.00035 [DOI:10.1007/978-3-319-70169-1_20]
  4. Adrian, W., N. Leone, M. Manna, & C. Marte. 2017. Document Layout Analysis for Semantic Information Extraction. In: Esposito F., Basili R., Ferilli S., Lisi F. (eds) AI*IA 2017 Advances in Artificial Intelligence. AI*IA 2017. Lecture Notes in Computer Science, vol 10640. Springer, Cham. Boukhers, Z., S. Ambhore, & S. Staab,. 2019. An end-to-end approach for extracting and segmenting high-variance references from pdf documents. ACM/IEEE Joint Conference on Digital Libraries (JCDL). Champaign, IL, USA. https://doi.org/10.1109/JCDL.2019.00035 [DOI:10.1007/978-3-319-70169-1_20]
  5. Cortes, C., & V. Vapnik. 1995. Support-vector networks. Machine Learning volume. 20: 273-297. [DOI:10.1007/BF00994018]
  6. Cortes, C., & V. Vapnik. 1995. Support-vector networks. Machine Learning volume. 20: 273-297. [DOI:10.1007/BF00994018]
  7. Cuong, N., M. Kumar, M.-Y. Kan, & W. Lee. 2015. Scholarly document information extraction using extensible features for efficient higher order semi-crfs. JCDL '15: proceedings of the 15th ACM/IEEE-cs joint conference on digital libraries. Knoxville, Tennessee, USA. [DOI:10.1145/2756406.2756946]
  8. Cuong, N., M. Kumar, M.-Y. Kan, & W. Lee. 2015. Scholarly document information extraction using extensible features for efficient higher order semi-crfs. JCDL '15: proceedings of the 15th ACM/IEEE-cs joint conference on digital libraries. Knoxville, Tennessee, USA. [DOI:10.1145/2756406.2756946]
  9. Do, H., M. Chandrasekaran, P. Cho, & M.-Y. Kan. 2013. Extracting and matching authors and affiliations in scholarly documents. JCDL '13: Proceedings of the 13th ACM/IEEE-cs joint conference on digital libraries. Indianapolis, Indiana, USA [DOI:10.1145/2467696.2467703]
  10. Do, H., M. Chandrasekaran, P. Cho, & M.-Y. Kan. 2013. Extracting and matching authors and affiliations in scholarly documents. JCDL '13: Proceedings of the 13th ACM/IEEE-cs joint conference on digital libraries. Indianapolis, Indiana, USA [DOI:10.1145/2467696.2467703]
  11. Ferrés, D., H. Saggion, F. Ronzano, & À. Bravo. 2018. PDFdigest: an adaptable layout-aware pdf-to-xml textual content extractor for scientific articles. 11th Language Resources and Evaluation Conference (LREC). Miyazaki, Japan
  12. Ferrés, D., H. Saggion, F. Ronzano, & À. Bravo. 2018. PDFdigest: an adaptable layout-aware pdf-to-xml textual content extractor for scientific articles. 11th Language Resources and Evaluation Conference (LREC). Miyazaki, Japan
  13. Kan, M.-Y., M. Luong, & T. Nguyen. 2010. Logical structure recovery in scholarly articles with rich document features. International Journal of Digital Library Systems 1 (4): 1-23. [DOI:10.4018/jdls.2010100101]
  14. Kan, M.-Y., M. Luong, & T. Nguyen. 2010. Logical structure recovery in scholarly articles with rich document features. International Journal of Digital Library Systems 1 (4): 1-23. [DOI:10.4018/jdls.2010100101]
  15. Kern, R., K. Jack, M. Hristakeva, & M. Granitzer. 2012. TeamBeam - meta-data extraction from scientific literature. D-Lib Magazine. [DOI:10.1045/july2012-kern]
  16. Kern, R., K. Jack, M. Hristakeva, & M. Granitzer. 2012. TeamBeam - meta-data extraction from scientific literature. D-Lib Magazine. [DOI:10.1045/july2012-kern]
  17. Kooli, N., & A. Belaid. 2016. Inexact graph matching for entity recognition in OCRed documents. 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico. IEEE. [DOI:10.1109/ICPR.2016.7900271]
  18. Kooli, N., & A. Belaid. 2016. Inexact graph matching for entity recognition in OCRed documents. 23rd International Conference on Pattern Recognition (ICPR). Cancun, Mexico. IEEE. [DOI:10.1109/ICPR.2016.7900271]
  19. Liu, R., L. Gao, D. An, Z. Jiang, & Z. Tang. 2017. Automatic document metadata extraction based on deep networks. Natural Language Processing and Chinese Computing. Lecture Notes in Computer Science, 10619: 305-317 [DOI:10.1007/978-3-319-73618-1_26]
  20. Liu, R., L. Gao, D. An, Z. Jiang, & Z. Tang. 2017. Automatic document metadata extraction based on deep networks. Natural Language Processing and Chinese Computing. Lecture Notes in Computer Science, 10619: 305-317 [DOI:10.1007/978-3-319-73618-1_26]
  21. Meng, B., L. Hou, E. Yang, & J. Li. 2018. Metadata extraction for scientific papers. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, 11221: 111-122 [DOI:10.1007/978-3-030-01716-3_10]
  22. Meng, B., L. Hou, E. Yang, & J. Li. 2018. Metadata extraction for scientific papers. Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL 2018, NLP-NABD 2018. Lecture Notes in Computer Science, 11221: 111-122 [DOI:10.1007/978-3-030-01716-3_10]
  23. Nasar, Z., S. Jaffry, & M. Malik. 2018. Information extraction from scientific articles: a survey. Scientometrics 117 (3): 1931-1990. [DOI:10.1007/s11192-018-2921-5]
  24. Nasar, Z., S. Jaffry, & M. Malik. 2018. Information extraction from scientific articles: a survey. Scientometrics 117 (3): 1931-1990. [DOI:10.1007/s11192-018-2921-5]
  25. Peng, F., & A. McCallum. 2006. Information extraction from research papers using conditional random fields. Information Processing & Management 42 (4): 963-979. [DOI:10.1016/j.ipm.2005.09.002]
  26. Peng, F., & A. McCallum. 2006. Information extraction from research papers using conditional random fields. Information Processing & Management 42 (4): 963-979. [DOI:10.1016/j.ipm.2005.09.002]
  27. Piskorski, J., & R. Yangarber. 2013. Information extraction: past, present and future. In Multi-source, Multilingual Information Extraction and Summarization. Berlin: Springer. [DOI:10.1007/978-3-642-28569-1_2]
  28. Piskorski, J., & R. Yangarber. 2013. Information extraction: past, present and future. In Multi-source, Multilingual Information Extraction and Summarization. Berlin: Springer. [DOI:10.1007/978-3-642-28569-1_2]
  29. Qiu, S., & T. Zhou. 2019. A method of extracting metadata information in digital books. 10th International Conference on Information Technology in Medicine and Education (ITME). Qingdao, China. [DOI:10.1109/ITME.2019.00136]
  30. Qiu, S., & T. Zhou. 2019. A method of extracting metadata information in digital books. 10th International Conference on Information Technology in Medicine and Education (ITME). Qingdao, China. [DOI:10.1109/ITME.2019.00136]
  31. Rizvi, S., A. Dengel, & S. Ahmed 2019. DeepBiRD: An automatic bibliographic reference detection approach.
  32. Rizvi, S., A. Dengel, & S. Ahmed 2019. DeepBiRD: An automatic bibliographic reference detection approach.
  33. Safder, I., S.-U. Hassan, A. Visvizi, T. Noraset, R. Nawaz, & S. Tuarob. 2020. Deep learning-based extraction of algorithmic metadata in full-text scholarly documents. Information Processing & Management 57 (6): 102269. [DOI:10.1016/j.ipm.2020.102269]
  34. Safder, I., S.-U. Hassan, A. Visvizi, T. Noraset, R. Nawaz, & S. Tuarob. 2020. Deep learning-based extraction of algorithmic metadata in full-text scholarly documents. Information Processing & Management 57 (6): 102269. [DOI:10.1016/j.ipm.2020.102269]
  35. Souza, A., V. Moreira, & C. Heuser. 2014. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF. DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering. New York, NY, USA. [DOI:10.1145/2644866.2644872]
  36. Souza, A., V. Moreira, & C. Heuser. 2014. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF. DocEng '14: Proceedings of the 2014 ACM symposium on Document engineering. New York, NY, USA. [DOI:10.1145/2644866.2644872]
  37. Tharwat, A., A. Hassanien, & B. Elnaghi. 2017. A BA-based algorithm for parameter optimization of Support Vector Machine. Pattern Recognition Letters 93: 13-22. [DOI:10.1016/j.patrec.2016.10.007]
  38. Tharwat, A., A. Hassanien, & B. Elnaghi. 2017. A BA-based algorithm for parameter optimization of Support Vector Machine. Pattern Recognition Letters 93: 13-22. [DOI:10.1016/j.patrec.2016.10.007]
  39. Tkaczyk, D., P. Szostek, M. Fedoryszak, P. Dendek, & Ł. Bolikowski. 2015. CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR) 18: 317-335. [DOI:10.1007/s10032-015-0249-8]
  40. Tkaczyk, D., P. Szostek, M. Fedoryszak, P. Dendek, & Ł. Bolikowski. 2015. CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR) 18: 317-335. [DOI:10.1007/s10032-015-0249-8]
  41. Yang, X.-S. 2010. A new metaheuristic bat-inspired algorithm. Nature Inspired Cooperative Strategies for Optimization (NICSO 2010) 284: 65-74. [DOI:10.1007/978-3-642-12538-6_6]
  42. Yang, X.-S. 2010. A new metaheuristic bat-inspired algorithm. Nature Inspired Cooperative Strategies for Optimization (NICSO 2010) 284: 65-74. [DOI:10.1007/978-3-642-12538-6_6]