Iranian Journal of Information Processing and Management

Iranian Journal of Information Processing and Management

Using Computational Methods for Persian Collocations Identification and Extraction

Document Type : Original Article

Authors
1 Assistant Professor; School of Intelligent Systems; University of Tehran
2 Master of Computational Linguistics, Tehran University; Tehran, Iran
3 Professor; Faculty of Literature and Humanities; University of Tehran
4 Associate Professor; ;Faculty of New Sciences and Technologies, Tehran University; Tehran, Iran
Abstract
This article explores the recognition of collocations in Persian language. Previous research in this field has primarily been statistical and comparative in nature. The objective of this study is to identify collocations using a corpus-based and computational approach. To this end, the Persian language database is utilized as the research corpus. Additionally, due to the absence of a comprehensive collocation dictionary for Persian, a dataset of collocations has been constructed based on the Advanced Learners’ Persian Dictionary. Using FastText embedding vectors, a language model is trained with a Long Short-Term Memory (LSTM) network. Furthermore, by fine-tuning ParsBert, the performance of this language model is evaluated using lists of a thousand collocations and non-collocations. Finally, a comparative analysis of collocation recognition is conducted using Google Translate by translating a thousand Persian sentences into English, each containing at least one collocation. The results indicate that the ParsBert model achieves recall rates of 93.95% and 85.8% for collocation and non-collocation recognition, respectively. In contrast, the LSTM-based language model achieves recall rates of 6.6% and 0% for collocation and non-collocation recognition, respectively. The comparative analysis of Google Translate accuracy in translating collocations yielded three key findings: 1) The collocation was correctly recognized and translated; 2) The collocation was not correctly recognized, resulting in a literal, word-for-word translation; and 3) The collocation is not recognized, leading to an incorrect translation
Keywords
Subjects

فهرست منابع
اخوان مهدوی، رسول. 1398. ارزیابی روش‌های پردازش متن بر روی داده‌های آگهی دیوار. بازیابی از: https://virgool.io/@rasoulam/%D8%A7%D8%B1%D8%B2%DB%8C%D8%A7%D8%A8%DB%8C-%D8%B1%D9%88%D8%B4%D9%87%D8%A7%DB%8C-%D9%BE%D8%B1%D8%AF%D8%A7%D8%B2%D8%B4-%D9%85%D8%AA%D9%86-%D8%A8%D8%B1-%D8%B1%D9%88%DB%8C-%D8%AF%D8%A7%D8%AF%D9%87%D9%87%D8%A7%DB%8C-%D8%A2%DA%AF%D9%87%DB%8C-%D8%AF%DB%8C%D9%88%D8%A7%D8%B1-clianu3d719w. (دسترسی در ۱۹/4/ ۱۴۰۲)
افراشی، ‌آزیتا. 1378. نگاهی به مسئله باهم‌آیی واژگان. متن‌پژوهی ادبی 7-8 (2): 73-82.
باطنی، محمدرضا. 1348. توصیف ساختمان دستوری زبان فارسی. انتشارات امیرکبیر.
جوادی، فروزان. 1385. بررسی مقابله‌ای باهمایی‌های زبان فارسی و انگلیسی و معادل‌های ترجمة آن. پایان‌نامه کارشناسی ارشد. دانشکده ادبیات و علوم انسانی. دانشگاه تهران.
چوپان، سیما. 1390. بررسی باهمایی واژگانی در زبان‌های فارسی و فرانسه. پایان‌نامه کارشناسی ارشد. دانشکده زبان‌های خارجی. دانشگاه آزاد اسلامی واحد تهران مرکزی.
عاصی، مصطفی. 1371. نقش ترکیب در گسترش واژگان زبان فارسی با نگرشی بر آثار نظامی گنجوی. فرهنگ (10). تهران: پژوهشگاه علوم انسانی و مطالعات فرهنگی.
_____. 1401. فرهنگ زبان‌‌آموز پیشرفته فارسی. تهران: انتشارات سمت.
مدرس خیابانی، شهرام. 1386. بررسی باهمایی واژگانی در زبان فارسی. پایان‌نامه دکتری. دانشکده زبان‌شناسی. دانشگاه علامه طباطبایی.
مصطفوی، مهدیه. 1395. باهمایی دستوری در زبان فارسی. پایان‌نامه کارشناسی ارشد. دانشکده ادبیات و علوم انسانی. دانشگاه الزهرا (س).
References
Anke, Luis Espinosa, Codina-Filbá, & Leo Joan Wanner. 2021. Evaluating language models for the retrieval and categorization of lexical collocations. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Kyiv, Ukraine: Association for Computational Linguistics.
Assi, M. 2019. Persian linguistic database (PLDB). In: Tehran: Institute for Humanities and Cultural Studies. Retrieved from < https://pldb.ihcs.ac.ir/>. (accessed December 17, 2022)
Atui, Kavosh Asadi, Heshaam Faili, & Kaveh Assadi Atuie. 2012. Collocation extraction using parallel corpus. Proceedings of COLING 2012: Posters.
Bischof, Beatrice, Klausurtagung Kleinwalsertal. 2004. The collocation in French. Retrieved from <http://www.ilg.uni-stuttgart.de/gk/aktivitaeten/dokumente/2004/bischof.pdf> (accessed October 1,2022)
Church, K. W., W. A. Gale, P. Hanks, & D. Hindle. 1991. Using Statistics in Lexical Analysis, in Lexical Acquisition Exploiting On-line Resources To Build A Lexicon, U. Zernik, Editor. 1991: Englewood Cliff: ?.
Cowie, A. P. 1981. The Treatment of Collocations and Idioms in Learners’ Dictionaries. Applied Linguistics II (3). https://doi.org/10.1093/applin/II.3.223
Cruse, D. Alan. 1986. Lexical semantics. Cambridge: Cambridge University Press.
Dehghani, M. 2020. Embedding. Retrieved from https://data-hub.ir/word-embedding-%DA%86%DB
%8C%D8%B3%D8%AA. (accessed July 08, 2023)
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://doi.org/10.18653/v1/n19-1423.
Farahani, Mehrdad, Mohammad Gharachorloo, Marzieh Farahani, & Mohammad Manthouri. 2021. ParsBERT: Transformer-based Model for Persian Language Understanding. Neural Processing Letters, 53 (6), 3831–3847. https://doi.org/10.1007/s11063-021-10528-4
Firth, John R. 1957. Modes of meaning, papers in linguistics. Oxford: Oxford University Press.
Fisas, Beatríz, Joan Codina-Filbà, Anke Luis Espinosa, & L. Wanner. 2020. CollFrEn: Rich Bilingual English–French Collocation Resource. In S. Markantonatou, J. McCrae, J. Mitrović, C. Tiberius, C. Ramisch, A. Vaidya, P. Osenova, & A. Savary (Eds.), Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons (pp. 1–12). Association for Computational Linguistics. [Held online]. https://aclanthology.org/2020.mwe-1.1/
Halliday, Michael Alexander Kirkwood, & Ruqaiya Hasan. 1986. Cohesion in english. London: Longman.
Harati Mokhtari, Parastoo, Reza Ghafar Samar, & Gholam Reza Kiany. 2016. Collocational Processing in Two Languages: A Psycholinguistic Comparison of Monolinguals and Bilinguals. Iranian Journal of English for Academic Purposes 5 (1): 69-52.
Khokhlova, Maria. 2020. Quantitative Properties of Russian Adjective-Noun Collocations across Dictionaries and Corpora. In A. M. Elizarov & N. V. Loukachevitch (Eds.), Proceedings of the Computational Models in Language and Speech Workshop (CMLS 2020) (Vol. 2780, pp. 202–211). CEUR-WS.org. https://ceur-ws.org/Vol-2780/paper18.pdf
Kordoni, Valia. 2017. Beyond Words: Deep Learning for Multiword Expressions and Collocations. In M. Popović & J. Boyd-Graber (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts (pp. 15–16). Vancouver, Canada: Association for Computational Linguistics. https://aclanthology.org/P17-5005/
Ma, Xiaolei, Jiyu Zhang, Bowen Du, Chuan Ding, & Leilei Sun. 2019. Parallel Architecture of Convolutional Bi-Directional LSTM Neural Networks for Network-Wide Metro Ridership Prediction. IEEE Transactions on Intelligent Transportation Systems 20 (6): 2278–2288. https://doi.org/10.1109/TITS.2018.2867042
Maru, Marco, Federico Scozzafava, Federico Martelli, & Roberto Navigli. 2019. SyntagNet: Challenging supervised word sense disambiguation with lexical-semantic combinations. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). (pp. 3534–3540). Hong Kong, China: Association for Computational Linguistics. https://aclanthology.org/D19-1359/, https://doi.org/10.18653/v1/D19-1359
Mckeown, Kathleen R, & Dragomir R. Radev. 2000. Collocations. Handbook of Natural Language Processing. New York: Marcel Dekker, 23-1.
Nattinger, James R., & Jeanette S. Decarrico. 1992. Lexical phrases and language teaching. Oxford University Press.
Olah, Christopher. 2015. Understanding LSTM Networks -- colah’s blog. http://colah.github.io/posts/2015-08-Understanding-LSTMs. (accessed Octobor 10, 2022)
Oquab, Maxime, Leon Bottou, Ivan Laptev, & Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition. (CVPR) (pp. 1717–1724). Columbus, OH, USA: IEEE. https://doi.org/10.1109/CVPR.2014.222Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, & Luke Zettlemoyer. 2018. Deep Contextualized Word Representations https://doi.org/10.18653/v1/n18-1202, https://aclanthology.org/N18-1202.
Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, & Luke Zettlemoyer. 2018. Deep Contextualized Word Representations https://doi.org/10.18653/v1/n18-1202, https://aclanthology.org/N18-1202.
Press, Ofir & Lior Wolf. 2016. Using the output embedding to improve language models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Volume 2, Short Papers (pp. 157–163). Valencia, Spain: Association for Computational Linguistics. https://aclanthology.org/E17-2025/
Radford, Alec, & Karthik Narasimhan. 2018. Improving Language Understanding by Generative Pre-Training. OpenAI. Retrieved from https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed ?)
Ramos, Margarita Alonso, Leo Wanner, Orsolya Vincze, Gerark Del Bosque, Nancy Vázquez Veiga, Estela Mosqueira Suárez, & Sabela Prieto González. 2010 May. Towards a Motivated Annotation Schema of Collocation Errors in Learner Corpora. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10) Valletta, Malta.
Riahi, Noushin & Fatemeh Sedghi. 2016. Improving the Collocation Extraction Method Using an Untagged Corpus for Persian Word Sense Disambiguation. Journal of Computational Chemistry. 4, 109–124. https://api.semanticscholar.org/CorpusID:8172246.
Seretan, Maria-Violeta. 2003. Syntactic and Semantic Oriented Corpus Investigation for Collocation Extraction, Translation and Generation. Ph. D. thesis, Language Technology Laboratory, Department of Linguistics, University of Geneva, Geneva, Switzerland.
Seretan, Violeta. 2013. On collocations and their interaction with parsing and translation. Informatics, 1(1), 11–31. https://doi.org/10.3390/informatics1010011
Smadja, Frank, & Kathleen Mckeown. 1991. Using collocations for language generation 1. Computational Intelligence 7 (4): 146-147, 229-239.
Vechtomova, Olga & John Vineet. 2017. UWat-Emote at EmoInt-2017: Emotion Intensity Detection using Affect Clues, Sentiment Polarity and Word Embeddings. In A. Balahur, S. M. Mohammad, & E. van der Goot (Eds.), Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA@EMNLP) (pp. 249–254). Copenhagen, Denmark: Association for Computational Linguistics. https://aclanthology.org/W17-5235/, https://doi.org/10.18653/v1/W17-5235

  • Receive Date 03 June 2024
  • Revise Date 06 October 2024
  • Accept Date 02 November 2024