Iranian Journal of Information Processing and Management

Iranian Journal of Information Processing and Management

Ferdows-Lex: A Lexical Corpus of Persian Language Teaching Materials for Teaching Non-Persian Learners

Document Type : Original Article

Authors
1 PhD candidate in General Linguistics, Linguistics Department,, Faculty of Letters and Humanities, Ferdowsi University of Mashhad, Mashhad, Iran
2 Associate professor, Department of Persian Language and Literature and Department of Linguistics, Faculty of Letters and Humanities, Ferdowsi University of Mashhad, Mashhad, Iran
3 Associate Professor, Department of linguistics, Faculty of Letters and Humanities, Ferdowsi University of Mashhad, Mashhad, Iran
4 PhD candidate in General Linguistics, Department of Linguistics, Faculty of letters and Humanities, Ferdowsi University of Mashhad, Mashhad, Iran
Abstract
The purpose of this study was to develop a corpus according to the vocabulary overlaps in the materials for Teaching Persian to Non-Persian Speakers (TPNPS) in the elementary, intermediate, and advanced levels. Computer tools and Corpus-Informed approaches using a three-step protocol were applied in this study. First, the research data was prepared. The data was selected from among 26 TPNPS textbooks. These included Parsa, Mina, Shiraz, Parfa, Amozash e Novin e Zaban e Farsi at three language proficiency levels. The total number of tokens in the research dataset was 15,585. The data was typed out, and then computational pre-processing and parts of speech (POS) tagging were carried out. Normalization was mainly performed using Dadmatools Package and tokenization, lemmatization and POS tagging (POS) were carried out through the standard STANZA package. Then, the vocabulary overlap range in all textbooks at each level and among all levels were analyzed using Python programming. Finally, the corpus was designed in the mark-up language of XML. The corpus had three proficiency levels each including vocabulary information like lemma, overlap range, alphabet, token, POS and metadata. The results showed, that the vocabulary overlapping range followed a fixed rate at first, decreased as the proficiency level increased i.e., this rate stood at about 36 percent and 36.5 percent in the elementary and intermediate levels whereas it declined to 13 percent at advanced levels. Furthermore, with regards to the POS analysis, nouns, verbs and adjectives were the most repeated ones in all three levels. Comparing the overlap of vocabulary among different levels (elementary to intermediate, intermediate to advanced, elementary and advanced), nouns had the highest share. The findings emphasized systematic development of teaching materials to gradual improvements of language skills. 
Keywords
Subjects

فهرست منابع
بی‌جن‌خان، محمود، و مهدی محسنی. 1391. فرهنگ بسامدی بر اساس پیکره متنی زبان فارسی امروز. تهران: مؤسسه انتشارات دانشگاه تهران.
ترابی، منیره. 1389. بررسی روش‌ها و معیارهای کاربرد پیکره‌ها در آموزش زبان. با توجه ویژه به زبان فارسی. پایان‌نامه کارشناسی ارشد. دانشگاه علامه طباطبائی.
جهانگردی، کیومرث. 1395. تحلیل کتاب‌های آموزش زبان فارسی به غیرفارسی‌زبانان: رویکرد پیکره‌ای-شناختی به آموزش واژگان. رساله دکتری. پژوهشگاه علوم انسانی و مطالعات فرهنگی.
حسنی، حمید. 1384. واژه‌های پرکاربرد فارسی امروز (بر مبنای پیکره یک میلیون لغتی). تهران: کانون زبان ایران.
شمس‌فرد، مهرنوش. 1401. دادگان‌ها و منابع زبان فارسی: از متن تا واژه. مهرنوش شمس‌فرد و محمود بی‌جن‌خان (ویراستاران)، پردازش متن و گفتار فارسی: مروری بر مبانی نظری و آخرین یافته‌های پژوهشی (1-25). تهران: سمت.
صحرائی، رضامراد، و سمیرا میرزائی. 1402. کاربردهای زبان شناسی پیکره‌ای در آموزش زبان فارسی به غیرفارسی‌زبانان. مطالعات زبان‌ها و گویش‌های غرب ایران (4) 11: 113-140.
عبادی، سامان، امیررضا ‌وکیلی‌فرد، و خسرو بهراملو. 1393. تدوین فهرست واژگان پایه برای زبان فارسی: رویکردی تلفیقی. پژوهشنامه آموزش زبان فارسی به غیرفارسی‌زبانان 3 (8): 3-23.
علایی ابوذر، الهام. 1397. بررسی پیکره-بنیاد هم‌نگاره‌های اسمی و صفتی فارسی جهت کمک به برچسب‌گذاری صحیح اجزای کلام. پژوهشنامه پردازش و مدیریت اطلاعات 34 (2): 897-922.
فرهنگستان زبان و ادب فارسی. 1401. دستور ‌خطّ فارسی. تهران: نشر آثار.
قیومی، مسعود. 1401. پیش‌پردازش و ابزارهای پایه. در ‌مهرنوش شمس‌فرد، و محمود بی‌جن‌خان (ویراستاران)، کتاب پردازش متن و گفتار فارسی: مروری بر مبانی نظری و آخرین یافته‌های پژوهشی (86-113). تهران: سمت.
قیومی، مسعود. 1396. مسئله چندواژگی در پردازش نحو رایانشی زبان فارسی. در مجموعه مقالات چهارمین همایش ملی زبان‌شناسی رایانشی، 11-40 . تهران: نشر نویسه پارسی.
نعمت‌زاده، شهین، محمد دادرس، مهدی دستجردی کاظمی، و محرم منصوری‌زاده. 1390. واژگان پایه فارسی از زبان کودکان ایرانی. تهران: مؤسسه فرهنگی مدرسه برهان.
وکیلی‌فرد، امیررضا. 1378. کدام زبان فارسی را به غیرفارسی‌زبانان آموزش دهیم؟ نامه پارسی 4 (3): 212-219.
References
Academy of Persian Language and Literature. 2023. Dastour-e-khat. Tehran: Asar Publication. [In Persian]
Ahmad, A., I. Ahmed Abbasi, R. Hussain Abbasi, & B. Rasheed. 2025. Exploring the intricate relationship between semantics and computational linguistics. Liberal Journal of Language and Literature Review 3 (1): 164-181.
Alayiaboozar, E. 2019. A corpus-based study of Persian noun and adjective homographs to help correct pos tagging. Iranian Journal of Information Processing and Management 34 (2): 897-922. [In Persian]
Alenizi, A., & R. Adawi. 2024. Investigating the Effectiveness of Using Corpus-Based Developed Materials in Vocabulary Learning for Saudi EFL Students. Forum for Linguistic Studies 6 (3): 721–745.
Anthony, L. 2023. AntConc (Version 4.3.1) [Computer software]. Tokyo, Japan: Waseda University. Available from https://www.laurenceanthony.net/software.html (accessed Jan 5, 2025)
Barth, D., & S. Schnell. 2022. Understanding Corpus Linguistics. London & New York: Routledge.
Biber, D., & E. Finegan. 1991. English Corpus Linguistics London & New York: Routledge.
Bijankhan, M., & M. Mohseni. 2012. Frequency dictionary according to a written corpus of today Persian language. Tehran: University of Tehran Press. [In Persian]
Brezina, V., & D. Gablasova. 2015. Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics 36 (1): 1-22.
Çalışkan, G., & S. I. Kuru Gönen. 2018. Training teachers on corpus-based language pedagogy: Perceptions on vocabulary instruction. Journal of Language and Linguistic Studies 14 (4):190-210.
Cervetti, G. N., E. H. Hiebert, P. D. Pearson & N. A. McClung. 2015. Factors that influence the difficulty of Science Words. Journal of Literacy Research 47 (2): 153–185. https://doi.org/10.1177/1086296X15615363
Chan, T. P., & H. C. Liou. 2005. Effects of web-based concordancing instruction on EFL students' learning of verb-noun collocations. Computer Assisted Language Learning 18 (3): 231-251.
Chen, H. J. H. 2011. Developing and evaluating a web-based collocational retrieval tool for EFL students and teachers. Computer Assisted Language Learning 24 (1): 59-76.
Cheng, W. 2012. Exploring Corpus Linguistics: Language in action. London: Routledge.
Cobb, T. 1999. Breadth and depth of lexical acquisition with hands-on concordancing. ComputerAssisted Language Learning 12 (4): 345–360.
Council of Europe. 2001. The common European framework of reference for languages: Learning, teaching and assessment. Cambridge: Cambridge University Press.
Daskalovska, N. 2015. Corpus-based versus traditional learning of collocations. Computer Assisted Language Learning 28 (2): 130-144.
Ebadi, S., A. R. Vakilifard & Kh. Bahramlu. 2014. Developing a General Service Wordlist for Persian Language: An Integrated Approach. Journal of Teaching Persian to Speakers of Other Languages 3 (8): 3-23. [In Persian]
Etezadi, R., M. Karrabi, N. Zare, M. B. Sajadi & M. T. Pilehvar. 2022. Dadmatools: Natural language processing toolkit for Persian language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations (pp. 124-130). Seattle, Washington.
Gardner, D. & M. Davies. 2014. A new academic vocabulary list. Applied Linguistics 35 (3): 305-327.
Ghayoomi, M. 2018. The problem of multi-words in syntactic processing of Persian, In Proceedings of the Fourth National Conference on Computational Linguistics (pp.11-40). Tehran: Neveeseh. [In Persian]
Ghayoomi, M. 2022. Pre-processing and basic tools. In Shamsfard, M and Bijankhan, M (Eds), Text and speech processing for Persian language: The state of the art and a brief review of the theoretical foundations (pp. 86-113). Samt. [In Persian]
Hassani, H. 2005. The Most Frequent Words of Today Persian. Tehran: Iran Language Institute. [In Persian]
Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Ide, N., & C. M. Sperberg-McQueen. 2023. XML in theory and practice.: Addison-Wesley Longman.
Indurkhya, N., & F. C. Damerau. 2010. Handbook of natural language processing. New York: CRC Press.
Jahangardi, K. 2016. An Analysis of Textbooks for Teaching Persian to Non-Persians: A Corpus-Cognitive Approach to Teaching Vocabulary. Doctoral dissertation. Ministry of Science, Research & Technology, Institute for Humanities & Cultural Studies. Iran. [In Persian]
Jurafsky, D., & J. H. Martin. 2024. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition with language models. Online manuscript. https://web.stanford.edu/~jurafsky/slp3. (accessed Jan 13, 2025)
Keck, C. M. 2004. Corpus linguistics and language teaching research: bridging the gap. Language Teaching Research (1): 83-109.
Kübler, S., & H. Zinsmeister. 2014. Corpus linguistics and linguistically annotated corpora. New York: Bloomsbury Publishing.
Leech, G. 1992. Corpora and Theories of Linguistic Performance. In J. Startvik (Ed.), Directions in Corpus Linguistics (pp. 105-122). Mouton de Gruyter. https://doi.org/10.1515/9783110867275.105.
Leech, G. 1997. Teaching and Language Corpora: A Convergence, in A. Wichmann, S. Fligelstone, T. McEnery and G. Knowles (eds) Teaching and Language Corpora, Harlow: Addison Wesley Longman, pp. 11–23.
Li, D., N. Noordin, L. Ismail & D. Cao. 2025. A systematic review of corpus-based instruction in EFL classroom. Heliyon, 11 (2), e42016. https://doi.org/10.1016/j.heliyon.2025.e42016
_____, S. 2017. Using corpora to develop learners’ collocational competence. Language Learning & Technology 21 (3): 153–171.
Ma, Q., F. Mei & B. Qian. 2024. Exploring EFL students’ pronunciation learning supported by corpus-based language pedagogy. Computer Assisted Language Learning ?: 1– 27. https://doi.org/10.1080/09588221.2024.2432965.
Ma, Q., R. Yuan, (Eric), L. M.E. Cheung, & J. Yang, J. 2022. Teacher paths for developing corpus-based language pedagogy: a case study. Computer Assisted Language Learning 37 (3): 461–492. McCarthy, M., and A. O’Keeffe. 2010. Historical perspective: What are corpora and how have they evolved? in A. O’Keeffe and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics (pp. 3-13). London: Routledge.
McEnery, T., & G. Brookes. 2022. Building a written corpus: What are the basics? In A. O’Keeffe and M. J. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics (pp. 35–47). London: Routledge.
McEnery, T., & A. Hardie. 2011. Corpus Linguistics: Method, theory and practice. Cambridge: Cambridge University Press.
McEnery, T., R. Xiao. & Y. Tono. 2006. Corpus- based language Studies: An advanced resource book. London and New York: Routledge.
Meunier, F. & R. Reppen. 2015. Corpus versus non-corpus-informed pedagogical materials: Grammar as the focus, in D. Biber and R. Reppen (eds.) The Cambridge Handbook of English Corpus Linguistics (pp. 498-514). Cambridge University Press. https://doi.org/10.1007/9781139764377.028
Meyer, Ch. F. 2004. English corpus linguistics: An introduction. Cambridge: Cambridge University Press.
Nagy, W. E., & E. H. Hiebert. 2011. Toward a theory of word selection. In M. L. Kamil, P. D. Pearson, E. B. Moje, & P. P. Afflerbach (Eds.), Handbook of reading research (Vol. 4, pp. 388-404). New York, NY: Longman.
Nation, P. 2001. Learning Vocabulary in Another Language. Cambridge: Cambridge University Press.
Nematzadeh, Sh., M. Dadras, M. Dastjerdi Kazemi, M. & Mansorizadeh. 2011. Persian core vocabulary based on Iranian children.Tehran: Borhan Cultural Institute. [In Persian]
Qi, P., Y. Zhang, Y. Zhang, J. Bolton, & C. D. Manning. 2020. Stanza: A Python natural language processing toolkit for many human languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pages 101–108,), Association for Computational Linguistics.
Rasooli, M. S., M. Kouhestani, & A. Moloodi. 2013. Development of a Persian syntactic dependency treebank. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 306-314). Atlanta, Georgia.
Reppen, R. 2022. Building a corpus: what are key considerations? In The Routledge handbook of corpus linguistics (pp. 13-20). London & New York: Taylor and Francis.
Sahraei, R. M., & Samira Mirzaei. 2023. Applications of Corpus Linguistics in Teaching Persian to Non-Persian Speakers. Journal of Research in Western Iranian Languages and Dialects 11 (4): 113-140. [In Persian]
Seraji, M. 2015. Morphosyntactic corpora and tools for Persian. Doctoral dissertation. Uppsala University.
______, B. Megyesi, & J. Nivre. 2012. A basic language resource kit for Persian. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12) (pp. 2245-2252), European Language Resources Association.
Shamsfard, M. 2022. Data and Persian resources: from text to word. In Shamsfard, M and Bijankhan, M (Eds), Text and speech processing for Persian language: The state of the art and a brief review of the theoretical foundations (pp. 25-1). Samt. [In Persian]
_____, H. S. Jafari, & M. Ilbeygi. 2010. STeP-1: A Set of Fundamental Tools for Persian Text Processing. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10) (pp.859-865), European Language Resources Association.
Sharifi Atashgah, M., & M. Bijankhan. 2009. Corpus-Based Analysis for Multi-Token Units in Persian, Proceedings of the 3rd Workshop on Computational Approaches to Arabic Script-Based Languages [at] MT, Ottawa, Canada.
Sinclair, J. M. 1987. Looking Up. London: Collins Publication and The University of Birmingham.
_____. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press.
_____. 2004. Trust the text: Language, corpus and discourse.: Routledge.
_____, & A. Renouf. 1988. A lexical syllabus for language learning, in R. Carter and M. McCarthy (eds.), Vocabulary and Language Teaching (140-160). London: Longman.
Szudarski, P. 2018. Corpus Linguistics for Vocabulary. London & New York: Routledge.
Tiansoodeenon, M., B. Meeporm, N. Kaewrattanapat, & S. Tarapond. 2023. Enhancing vocabulary acquisition through progressive word increments in English language learning. Journal of Liberal Arts RMUTT 4 (2): 88–100.
Tognini-Bonelli, E. 2001. Corpus Linguistics at Work. Amsterdam: John Benjamins.
Torabi, M. 2010. Study of methods and criteria for the application of corpora in language teaching, with special reference to Persian language. Master thesis. Allameh Tabatabai University. [In Persian]
Vakilifard, Amirreza. 1999. Which Persian language should we teach to Non-Persian speakers?. Name-ye-Farsi 4 (3): 212-219. [In Persian]
Varley, S. 2009. I'll just look that up in the concordancer: integrating corpus consultation into the language learning environment. Computer Assisted Language Learning 22 (2): 133-152.
West, M. 1953. A general service list of English words. London: Longman, Green & co.
Yan, J. & Q. Ma. 2025. Theory-supported corpus pedagogy for ESL pre-service teachers: using Parallel EAP Corpora for language learning. Journal of China Computer-Assisted Language Learning. https://doi.org/10.1515/jccall-2024-0016
Youngblood, A. M., & K. S. Folse. 2017. Survey of corpus-based vocabulary lists for tesol classes. MEXTESOL Journal 41 (1): 1-15.
Yu Liu, Ch. 2023. A corpus-based study of vocabulary in massive open online courses (MOOCs). English for Specific Purposes 72 (1): 40-50.
Volume 40, Issue 4 - Serial Number 124
Summer 2025
Pages 1179-1218

  • Receive Date 03 March 2025
  • Revise Date 01 May 2025
  • Accept Date 04 May 2025