A Persian Citation Parsing Method Using Support Vector Machine



Human users can easily divide a bibliographic reference to its constructing fields such as authors, title, journal, year, etc. However, due to the variations in formats and errors made by the authors in citing documents, it is difficult to automate this task. There exist many solutions for this problem, known as citation parsing problem in the literature, however, none of them is compatible with Persian language. This is mainly due to high language-sensitivity of these solutions. Considering the important role of citation parsing in tasks such as autonomous citation indexing and information retrieval, in this paper, we propose an intelligent method for citation parsing in Persian language. The proposed method uses the support vector machine (SVM) classification method as its core. The results of testing the proposed method using a dataset designed for this task show 95% in average for precision, recall and F1 measures for extracting different fields from a bibliographic reference which is quite plausible.


نصیری، جلال‌الدین. 1394. بازشناسی اعمال انسان با رویکرد مقاوم‌سازی دسته‌بند تفکیکی. رسالة دکتری دانشگاه تربیت مدرس.
کارگر، مرتضی. 1390. دسته‌بندی داده‌ها با استفاده از روش SVM. پایان‌نامة کارشناسی دانشگاه شهید باهنر کرمان.
Ahmed, M. W., and M. T. Afzal. 2020. FLAG-PDFe: Features oriented metadata extraction framework for scientific publications. IEEE Access 8; 99458-99469.
An, D., L. Gao, Z. Jiang, R. Liu, and Z. Tang. 2017. Citation metadata extraction via deep neural network-based segment sequence labeling. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management; 1967-1970. Singapore, Singapore.
Besagni, D., A. Belaïd, & N. Benet. 2003. A segmentation method for bibliographic references by contextual tagging of fields. In Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on; 384-388. Edinburgh, Scotland.
Bhardwaj A., D. Mercier, A. Dengel, and S. Ahmed. 2017. DeepBIBX: Deep Learning for Image Based Bibliographic Data Extraction. In International Conference on Neural Information Processing; Cham, Switzerland. 286-293.
Councill, I. G., C. L. Giles, and M. Y. Kan. 2008. ParsCit: an Open-source CRF Reference String Parsing Package. In proceedings of the 6th edition of the Language Resources and Evaluation Conference (LREC). Morocco. 661-667.
Ding, Y., G. Chowdhury, and S. Foo. 1999. Template mining for the extraction of citation from digital documents. In Proceedings of the Second Asian Digital Library Conference. Taiwan. 47-62.
Gupta, D., B. Morris, T. Catapano, and G. Sautter. 2009. A new approach towards bibliographic reference identification, parsing and inline citation matching. In International Conference on Contemporary Computing; 93-102. Noida, India.
Hashmi, A. M., M. T. Afzal, and S. ur Rehman. 2020. Rule Based Approach to Extract Metadata from Scientific PDF Documents. In 2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA); 1-4. Sydney, Australia.
Hetzner, E. 2008. A simple method for citation metadata extraction using hidden markov models. In Proceedings of the 8th ACM/ IEEE-CS joint conference on Digital libraries; Pittsburgh, Pennsylvania, USA. 280-284.
Huang, I. A., J. M. Ho, H. Y. Kao, and W. C. Lin. 2004. Extracting citation metadata from online publication lists using BLAST. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Sydney, Australia. 539-548.
Kim, Y. M., P. Bellot, J. Tavernier, E. Faath, and M. Dacos. 2012. Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools». In Proceedings of the 2012 ACM symposium on Document engineering; Paris, France. 209-212.
Lawrence, S., C. Lee Giles, and K. Bollacker. 1999a. Digital libraries and autonomous citation indexing. Computer 32 (6): 67-71.
Lawrence, S., C. L. Giles, and K. D. Bollacker. 1999b. Autonomous citation matching». In Proceedings of the third annual conference on Autonomous Agents; 392-393. Seattle, Washington, USA.
Lopez, P. 2009. GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In International Conference on Theory and Practice of Digital Libraries; 473-474. Glasgow, United Kingdom.
Namikoshi, D., M. Ohta, A. Takasu, and J. Adachi. 2017. CRF-based bibliography extraction from reference strings using a small amount of training data. Twelfth International Conference on Digital Information Management (ICDIM); 59-64.
Nasar, Z., S.W. Jaffry, and M.K. Malik. 2018. Information extraction from scientific articles: a survey. Scientometrics 117: 1931–1990.
Ojokoh, B., M. Zhang, and J. Tang. 2011. A trigram hidden Markov model for metadata extraction from heterogeneous references. Information Sciences 181 (9): 1538-1551.
Peng, F., and A. McCallum. 2013. Accurate information extraction from research papers using conditional random fields. https://aclanthology.org/N04-1042.pdf. (accessed April 13, 2013).
Prasad, A., M. Kaur, and M.Y. Kan. 2018. Neural ParsCit: a deep learning-based reference string parser. International Journal of Digital Library 19: 323–337.
Rizvi, S. T. R., A. Dengel, and S. Ahmed. 2020. A Hybrid Approach and Unified Framework for Bibliographic Reference Extraction. IEEE Access, 8; 217231-217245.
Tkaczyk, D. 2017. New Methods for Metadata Extraction from Scientific Literature. arXiv preprint arXiv:1710.10201.: http://arxiv.org/abs/1710.10201
_____, A. Collins, P. Sheridan, and J. Beel. 2018a. Machine learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries; 99-108. Fort Worth, Texas, USA.
Tkaczyk, D., R. Gupta, R. Cinti, and J. Beel. 2018b. Parsrec: A novel meta-learning approach to recommending bibliographic reference parsers. Dublin, Ireland. arXiv preprint arXiv: 1811. 10369.
Tkaczyk, D., P. Sheridan and J. Beel. 2018c. ParsRec: Meta-Learning Recommendations for Bibliographic Reference Parsing. In Proceedings of the Late-Breaking Results track part of the Twelfth ACM Conference on Recommender Systems (RecSys '18), Vancouver, BC, Canada, 2018.
Tkaczyk, D., P. Szostek, P. J. Dendek, M. Fedoryszak, and L. Bolikowski. 2014. Cermine--automatic extraction of metadata and references from scientific literature. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on; 217-221. Tours, France.
Tkaczyk, D., P. Szostek, M. Fedoryszak, P. J. Dendek, and L. Bolikowsk. 2015. CERMINE: automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition 18 (4): 317-335.
Vapnik, V. N. 1998. Statistical Learning Theory. NewYork: John Wiley & Sons.
_____. 1995. The Nature of Statistical Learning Theory. NewYork: Springer-Verlag.
Yin, P., M. Zhang, Z. Deng, and D. Yang. 2004. Metadata extraction from bibliographies using bigram HMM. In International Conference on Asian Digital Libraries; 310-319. Florida, USA.
Zhang, Q., Y. G. Cao, and H. Yu (2011). «Parsing citations in biomedical articles using conditional random fields». Computers in biology and medicine, 41 (4); 190-194.
Zhang, X., J. Zou, D. X. Le, and G. R. Thoma. 2011. A structural SVM approach for reference parsing. In 2010 Ninth International Conference on Machine Learning and Applications (pp. 479-484). IEEE. Washington DC, USA.
Zou, J., D. Le, and G. R. Thoma. 2010. Locating and parsing bibliographic references in HTML medical articles. International Journal on Document Analysis and Recognition 13 (2): 107-119.