معرفی و آزمون پیکره علیت PerCause برای شناسایی روابط علّی فارسی

نویسندگان

آزمایشگاه پردازش زبان طبیعی، دانشگاه شهید بهشتی، تهران، ایران

چکیده

شناسایی روابط علّی و همچنین تعیین مرز عناصر علّی در متن، از جمله مسائل چالش برانگیز در پردازش زبان طبیعی (NLP < /span>) به ویژه در زبان‌‌های کم‌منبع مانند فارسی است. در این پژوهش، در راستای آموزش سیستمی برای شناسایی روابط علّی و مرز عناصر آن، یک پیکره علّیت برچسب خورده انسانی برای زبان فارسی معرفی می‌شود. این مجموعه شامل 4446 جمله (مستخرج از پیکره بیجن خان و متن یکسری کتاب) و 5128 رابطه علّی است و در صورت وجود، سه برچسب علت، معلول و نشانه علّی برای هر رابطه مشخص شده است. ما از این پیکره برای آموزش سیستمی برای تشخیص مرزهای عناصر علّی استفاده کردیم. همچنین، یک بستر آزمون شناسایی علّیت را با سه روش یادگیری ماشین و دو سیستم یادگیری عمیق مبتنی بر این پیکره ارائه می‌کنیم. ارزیابی‌های عملکرد نشان می‌دهد که بهترین نتیجه کلی از طریق طبقه‌بندی کننده CRF به دست می‌آید که معیار F برابر 76% را ارائه می‌کند. علاوه بر این، بهترین صحت (91.4٪) در روش یادگیری عمیق BiLSTM-CRF به دست آمده است. به نظر می‌رسد وجود CRF به دلیل مدلسازی بافتار منجر به بهبود دقت سیستم می‌شود.

کلیدواژه‌ها


عنوان مقاله [English]

Persian Causality Corpus (PerCause) and the Causality Detection Benchmark

نویسندگان [English]

  • Zeinab Rahimi
  • Mehrnoush ShamsFard
چکیده [English]

Recognizing causal elements and causal relations in the text is among the challenging issues in natural language processing (NLP), specifically in low-resource languages such as Persian. In this research, we prepare a causality human-annotated corpus for the Persian language. This corpus consists of 4446 sentences and 5128 causal relations. Three labels of Cause, Effect, and Causal mark are specified to each relation, if possible. We used this corpus to train a system for detecting causal elements’ boundaries.
Also, we present a causality detection benchmark for three machine-learning methods and two deep learning systems based on this corpus. Performance evaluations indicate that our best total result is obtained through the CRF classifier, which provides an F-measure of 0.76. In addition, the best accuracy (91.4%) is obtained through the BiLSTM-CRF deep learning method

کلیدواژه‌ها [English]

  • PerCause
  • Causality annotated corpus
  • causality detection
  • Deep Learning
  • CRF
Bijankhan, Mahmoud. 2004. The role of the corpus in writing a grammar: An introduction to a software. Iranian Linguistic Journal 19 (2): 48-67.
Blanco, Eduardo, Nuria Castell, and Dan Moldovan. 2008. Causal relation extraction. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08). Morocco.
Chang, Du-Seong, and Key-Sun Choi. 2004. Causal relation extraction using cue phrase and lexical pair probabilities. In International Conference on Natural Language Processing. pp. 61-70. Springer, Berlin, Heidelberg.
Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20 (1): 37-46.
Dasgupta, Tirthankar, Rupsa Saha, Lipika Dey, and Abir Naskar. 2018. Automatic extraction of causal relations from text using linguistically informed deep neural networks. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pp. 306-316. Melbourne, Australia.
Deleger, Louise, Qi Li, Todd Lingren, Megan Kaiser, and Katalin Molnar. 2012. Building gold standard corpora for medical natural language processing tasks. In AMIA Annual Symposium Proceedings, vol. 2012, p. 144. Chicago, USA: American Medical Informatics Association.
Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805.
Dunietz, Jesse, Lori Levin, and Jaime G. Carbonell. 2015. "Annotating causal language using corpus lexicography of constructions." In Proceedings of the 9th Linguistic Annotation Workshop, pp. 188-196.
Dunietz, Jesse, Lori Levin, and Jaime G. Carbonell. 2017." The BECauSE corpus 2.0: Annotating causality and overlapping relations." In Proceedings of the 11th Linguistic Annotation Workshop, pp. 95-104.
Farahani, Mehrdad, Mohammad Gharachorloo, Marzieh Farahani, and Mohammad Manthouri. 2021. "Parsbert: Transformer-based model for persian language understanding." Neural Processing Letters 53, no. 6: 3831-3847.
Garcia, Daniela. 1997. "COATIS, an NLP system to locate expressions of actions connected by causality links." In International Conference on Knowledge Engineering and Knowledge Management, pp. 347-352. Springer, Berlin, Heidelberg.
Girju, Roxana. 2003. "Automatic detection of causal relations for question answering." In Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering, pp. 76-83.
Goyal, Archana, Kumar Manish, and Gupta Vishal. 2017. "Named entity recognition: applications, approaches and challenges." International Journal of Advance Research in Science and Engineering 35 (5): 482-489.
Goyal, Archana, Vishal Gupta, and Manish Kumar. 2018. "Recent named entity recognition and classification techniques: a systematic review." Computer Science Review 29:21-43.
Green, Annette M. 1997. "Kappa statistics for multiple raters using categorical classifications." In Proceedings of the 22nd annual SAS User Group International conference, vol. 2, p. 4.
Gurulingappa, Harsha, Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, and Luca Toldo. 2012. "Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports." Journal of biomedical informatics 45, no. 5: 885-892.
Hashimoto, Chikara. 2019. "Weakly supervised multilingual causality extraction from Wikipedia." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2988-2999.
Huang, Zhiheng, Wei Xu, and Kai Yu. 2015. "Bidirectional LSTM-CRF models for sequence tagging." arXiv preprint arXiv: 1508.01991.
Karimi, Sarvnaz, Alejandro Metke-Jimenez, Madonna Kemp, and Chen Wang. 2015. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics 55: 73-81.
Khoo, Christopher SG, Chan Syin, and Niu Yun. 2000. Extracting causal knowledge from a medical database using graphical patterns. In Proceedings of the 38th annual meeting of the association for computational linguistics, pp. pp. 336-343. Hong Kong.
Leaman, Robert, Christopher Miller, and Graciela Gonzalez. 2009. Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. In Proceedings of the 2009 Symposium on Languages in Biology and Medicine, vol. 82, no. 9. Jeju Island, South Korea.
Luo, Zhiyi, Yuchen Sha, Kenny Q. Zhu, Seung-won Hwang, and Zhongyuan Wang. 2016. Commonsense causal reasoning between short texts. In Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning. Cape Town, South Africa.
McCallum, Andrew, and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of CoNLL. Edmonton, Canada.
Mihăilă, Claudiu, Tomoko Ohta, Sampo Pyysalo, and Sophia Ananiadou. 2013. BioCause: Annotating and analysing causality in the biomedical domain. BMC bioinformatics 14 (1): 1-18.
Mirza, Paramita. 2014. Extracting temporal and causal relations between events. In Proceedings of the ACL 2014 Student Research Workshop, pp. 10-17. Baltimore, Maryland, USA.
Mirza, Paramita, and Sara Tonelli. 2016. Catena: Causal and temporal relation extraction from natural language texts. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan: Technical Papers, pp. 64-75.
Mostafazadeh, Nasrin, Alyson Grealish, Nathanael Chambers, James Allen, and Lucy Vanderwende. 2016. CaTeRS: Causal and temporal relation scheme for semantic annotation of event structures. In Proceedings of the Fourth Workshop on Events, pp. 51-61. San Diego, California.
Ning, Qiang, Zhili Feng, Hao Wu, and Dan Roth. 2019. Joint reasoning for temporal and causal relations. arXiv preprint arXiv: 1906.04941.
Ramshaw, Lance A., and Mitchell P. Marcus. 1995. Text chunking using transformation-based learning. Natural language processing using very large corpora. Dordrecht: Springer.
Rehbein, Ines, and Josef Ruppenhofer. 2020. A new resource for German causal language. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 5968-5977. Pharo.
Sadek, Jawad, and Farid Meziane. 2018. Building a causation annotated corpus: the Salford Arabic Causal Bank-proclitics. In 11th Edition of the Language Resources and Evaluation Conference. Miyazaki Japan.
Sadek, Jawad, and Farid Meziane. 2018. Learning causality for Arabic-proclitics. Procedia computer science 142: 141-149.
Schneider, Nathan, Jena D. Hwang, Vivek Srikumar, Meredith Green, Kathryn Conger, Tim O'Gorman, and Martha Palmer. 2016. A corpus of preposition supersenses in English web reviews. arXiv preprint arXiv: 1605.02257.
Schuler, Karin Kipper. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon. Pennsylvania: University of Pennsylvania.
Shamsfard, Mehrnoush, Hoda Sadat Jafari, and Mahdi Ilbeygi. 2010. STeP-1: A Set of Fundamental Tools for Persian Text Processing. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10) 2010. May. Malta.