Natural Language Text Corpus: Design, Construction and Management

Asadi, Hamideh; Naghshineh, Nader; Hosseini Beheshti, Moluk Sadat

doi:10.22034/jipm.2025.709151

Natural Language Text Corpus: Design, Construction and Management

Document Type : Original Article

Authors

Hamideh Asadi ¹

Nader Naghshineh ²

Moluk Sadat Hosseini Beheshti ³

¹ PhD in Library and Information Science-Information Retrieval; University of Tehran;

² PhD in Library and Information Science; Associate Professor in Library and Information Science; University of Tehran

³ PhD in Linguistics; Associated Professor in Terminology & Ontology Research Group; Iranian Research Institute for Information Science and Technology (IranDoc),

10.22034/jipm.2025.709151

Abstract

Considering the role of corpora in various fields of study and the need to construct a general corpus to increase efficiency and effectiveness in processes that require the extraction/use of natural language text, the purpose of this study is to focus on design and automatic construction of natural language text corpus and software for its management.
In this research, a technology-based method has been used to construct a monolingual corpus in Persian language. This corpus is produced automatically by collecting web data and its sources are news texts included in Persian language news agencies.
In the study, a corpus of natural language texts in Persian language was made. Due to the automaticity of the construction process, software is needed to manage it both in the construction stage and in the information extraction stage, which was designed, construct and implemented in this study.
The construction of general corpus of natural language texts is used for various research purposes, and the proposed method and the use of introduced tools in this study can facilitate the construction of corpus. Also, software design for corpus management will save time and cost of construction and will provide the possibility of extracting information from it.

Keywords

Corpus, Data Set, Natural Language Processing, NLP, Corpus Linguistic, Artificial Intelligence

Subjects

Language and Linguistic Tools

فهرست منابع

افراشی، آزیتا، مصطفی عاصی، و کامیار جولایی. 1394. استعاره‏های مفهومی در زبان فارسی؛ تحلیلی شناختی و پیکره‏مدار. زبان‏شناخت 6 (2): 39-61.

بحرانی، محمد، حسین صامتی، نازیلا حافظی، و سعیده ممتازی. 1386، اسفند 19-21. خوشه‏بندی خودکار کلمات بر اساس مقوله‏های نحوی برای سیستم‏های بازشناسی گفتار پیوسته فارسی. مقاله ارائه شده در سیزدهمین کنفرانس ملی انجمن کامپیوتر ایران. جزیره کیش، ایران.

پاول، رونالد ار. 1997. روش‏های اساسی پژوهش برای کتابداران. مترجم: نجلا حریری 1389. [تهران]: آثار نفیس.

دشتبانی، شکوفه، محرم منصوری‏زاده، و محمد نصیری. 1391. طراحی و ساخت پیکرة متنی برای حوزه تخصصی فاوا. مقاله ارائه‌شده در نخستین کنفرانس بین‏المللی پردازش خط و زبان فارسی. سمنان.

دفتری‏نژاد، الهه. 1385. فرایند مارکوف، الگوی احتمالاتی رفع ابهام در زبان‌شناسی رایانه‏ای. علوم انسانی دانشگاه الزهرا (س) 16-17 (63-64): 107-139.

ذوالفقار، زهره، طیبه موسوی میانگاه، بلقیس روشن، و امیررضا وکیلی‏فرد. 1399. بررسی تکنیک‏های بهبود عملکرد روش‏های بسامدشماری پیکره‏بنیاد در استخراج خودکار واژگان (مورد مطالعه: واژگان پایه علوم پزشکی). پژوهشنامه پردازش و مدیریت اطلاعات 35 (4): 1039-1064.

رباطی، زهرا. 1393. دسته‏بندی اخبار فارسی با استفاده از تکنیک‏های هوش مصنوعی. پایان‏نامه کارشناسی ‏ارشد، دانشگاه صنعتی شاهرود. [شاهرود].

رضایی‏پناه، امیر، و سمیه شوکتی مقرب. 1395. تحلیل پیکره‏بنیاد مدارهای هویت در سند استراتژی امنیت ملی 2015 بریتانیا. در مجموعه مقالات دومین همایش ملی زبان‌شناسی پیکره‏ای، ویراسته آزاده میرزایی، 69-91. تهران: نشر نویسه پارسی.

روحانیان، مرتضی، مصطفی صالحی، علی درزی، و وحید رنجبر. 1399. تحلیل احساس در رسانه‏های اجتماعی فارسی با رویکرد شبکه عصبی پیچشی. مهندسی برق و مهندسی کامپیوتر ایران 18 (1): 59-66.

سلامی، مریم، زهرا سادات جلالی، مریم‏ پاکدامن نائینی، و محمد علائی آرانی. 1394. تحلیل محتوای مقالات علوم پزشکی بر اساس مطالعه پیکره زبانی. مدیریت اطلاعات سلامت 12 (5): 595-607.

شهشهانی، مهسا، مهدی محسنی، آزاده شاکری، و هشام فیلی. 1398. پیکره برچسب خورده موجودیت‏های اسمی زبان فارسی. پردازش علائم و داده‏ها 16 (1): 91-109.

صفری، سعید. 1394. از زبان‌شناسی پیکره‏ای تا پیکره زبان‏آموز. در مجموعه مقالات نخستین همایش ملی زبان‌شناسی پیکره‏ای، ویراسته آزاده میرزایی، 131-152. تهران: نشر نویسه پارسی.

_____. 1395. پیکره زبان‏آموز: مبانی، روش‏شناسی، الگوی طراحی و تولید. در مجموعه مقالات دومین همایش ملی زبان‌شناسی پیکره‏ای، ویراسته آزاده میرزایی، 93-123. تهران: نشر نویسه پارسی.

عاصی، مصطفی، و سعیده قندی. 1394. پایگاه داده‏های زبان فارسی و پیکره تاریخی آن. در مجموعه مقالات نخستین همایش ملی زبان‌شناسی پیکره‏ای، ویراسته آزاده میرزایی، 193-220. تهران: نشر نویسه پارسی.

علایی ابوذر، الهام، نصراله پاک‏نیت، علی‏اصغر حجت‏پناه، مجتبی زالی، و محمدهادی آقالویی آغمیونی. 1400. معرفی یک پیکره متنی تخصصی: پیکره پژوهشنامه. پژوهش‏های زبان‌شناسی تطبیقی 11 (22): 271-289.

قدردوست نخچی، سعیده، ندا پورمرتضی خامنه، پری‏ناز دادرس، و سلیمه زمانی. 1395. بررسی پیکره‏بنیاد مقوله قید. در مجموعه مقالات دومین همایش ملی زبان‌شناسی پیکره‏ای، ویراسته آزاده میرزایی، 147-165. تهران: نشر نویسه پارسی.

کامیابی گل، عطیه، الهام اخلاقی باقوجری، احسان عسگریان، و هانیه حبیبی. 1397. استخراج اطلاعات از پیکره زبانی: معرفی پیکره مقاله‏های علمی پژوهشی دانشگاه فردوسی مشهد. کتابداری و اطلاع‏رسانی 21 (2): 3-25.

مظاهری، ویدا، و چنگیز دل‏آرا. 1398، مرداد. استخراج اطلاعات از وب‏سایت‏های خبری با استفاده از روش مبتنی ‌بر آنتولوژی. مقاله ارائه‌شده در هفتمین کنفرانس ملی علوم و مهندسی کامپیوتر وفناوری اطلاعات. مازندران، ایران.

میرزائی، آزاده، و پگاه صفری. 1394. ساختِ واژه- متن‏های تخصصی و عمومی زبان فارسی بر اساس بسامدگیری واژه‏های نقشی و محتوایی. در مجموعه مقالات نخستین همایش ملی زبان‌شناسی پیکره‏ای. ویراسته آزاده میرزایی، 175-191. تهران: نشر نویسه پارسی.

میرزائی، آزاده، و امیرسعید مولودی. 1393. نخستین پیکره نقش‏های معنایی زبان فارسی. علم زبان 2 (3): 29-47.

نظارات، امین؛ طیبه موسوی میانگاه. 1390. طراحی و پیاده‏سازی یک سامانه بازیابی اطلاعات دو زبانه با استفاده از پیکره‏های زبانی. پژوهشنامه پردازش و مدیریت اطلاعات، ویژه‏نامه ذخیره، بازیابی و مدیریت اطلاعات: 197-212.

نظری، مریم. 1392. گسست دانشی در پژوهش‏های مولد چگونه رصد می‏شود؟ پیشنهاد ترسیم دو نقشه: نقشه دانش و نقشه پژوهش. تحقیقات کتابداری و اطلاع‏رسانی دانشگاهی 47 (1): 27-48.

References:

Aasi, M., & S. Ghandi. 2015. Persian language databases and their historical corpora. In A. Mirzaei (ed.), Proceedings of the 1st. national conference on Croups linguistics (pp. 193–220). Tehran: Neviseh Parsi Publishing. [In Persian]

Afrashi, A., S. M. Asi, and K. Joulaei. 2016. Conceptual metaphors in Persian: A cognitive perspective and a corpus driven Analysis. Language Studies 6 (12): 39-61. [In Persian]

Alayiaboozar, E., N. Pakniat, A. Hojjatpanah, M. Zali, and M. Aghalouyaghmiyouni. 2021. introducing a specialized corpus: Pazhooheshname. Comparative Linguistic Research 11 (22): 271-289. [In Persian]

Bahrani, M., H. Sameti, N. Hafezi, & S. Momtazi. 2007. Automatic word clustering based on syntactic categories for continuous Persian speech recognition systems. In 13th Annual National Conference of the Iranian Computer Society. Kish Island, Iran. [In Persian]

Bennett, Gena R. 2010. Using Corpora in the Language Learning Classroom: Corpus Linguistics for Teachers. [Michigan]: University of Michigan Press.

Daftarinezhad, E. 2007. Markof Model: A Probability Model for Disambiguation in Computational Linguistics. Journal of Humanities 16-17 (63-64): 107-139. [In Persian]

Dashtbani, Sh., M. Mansoorizadeh, & M. Nassiri. 2012. Design and construction of a textual corpus for the ICT domain. Paper presented at the 1st International Conference on Persian Script and Language Processing, Semnan, Iran. [In Persian]

Ghadardoust Nakhchi, S., N. Pourmortazavi Khameneh, P. Dadras, & S. Zamani. 2016. A corpus-based study of the adverbial category. In A. Mirzaei (Ed.), Proceedings of the 2nd. national conference on corpus linguistics (pp. 147–165). Tehran: Neviseh Parsi Publishing. [In Persian]

Kamyabi Gol, A., E. Akhlaghi Baghujeri, E. Asgarian, and H. Habibi. 2018. Extracting information from language corpus: introducing the corpus of scientific articles of Ferdowsi University of Mashhad. Library and Information Sciences 21 (2): 3-25. [In Persian]

Kilgarriff, Adam, and Gregory Grefenstette. 2003. Introduction to the Special Issue on the Web as Corpus. Computational Linguistics 29 (3): 333-347.

Kofax. 2016. Kofax Kapow. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwid7NjGuJ__AhVgSfEDHeNbAGIQFnoECAwQAQ&url=https%3A%2F%2Fcobwebb.com%2Fwp-content%2Fuploads%2F2021%2F11%2Fds-kofax-kapow-en.pdf&usg=AOvVaw2aAIEADX7lGrhnmULWN85g (accessed March 7, 2023).

Kofax. 2017. Kofax Kapow. https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjck7GEs5__AhV1RvEDHdJ9B6cQFnoECAwQAQ&url=https%3A%2F%2Fbpas.dk%2Fwp-content%2Fuploads%2F2018%2F02%2FKapow-datasheet.pdf&usg=AOvVaw0Oxy4hId6MjYY_a0D-MAGm (accessed March 7, 2023).

Li, Qin, Shaobo Li, Sen Zhang, Jie Hu, and Jianjun Hu. 2019. A Review of Text Corpus-Based Tourism Big Data Mining. Applied Sciences 9: 3300.

Liu, Vinci, and James R. Curran. 2006, April 3-7. Web Text Corpus for Natural Language Processing. Paper presented at 11^th Conference of EACL: The European Chapter of the Association for Computational Linguistics. Trento, Italy.

Mazaheri, V., & Ch. Delara. 2019. Information extraction from news websites using an ontology-based method. Paper presented at the 7th National Conference on Computer Science and Information Technology Engineering, Mazandaran, Iran. [In Persian]

Mihalcea, Rada, Courtney Corley, Carlo Strapparava. 2006. Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In AAAl’06: Proceeding of the 21st National Conference on Artificial Intelligence, (Vol.1, P: 775-780). Boston, Massachusetts.

Mirzaei, A., & P. Safari. 2015. Lexical construction in specialized and general Persian texts based on the frequency of functional and content words. In A. Mirzaei (Ed.), Proceedings of the 1st. national conference on Croups linguistics (pp. 175–191). Tehran: Neviseh Parsi Publishing. [In Persian]

Mirzaei, A. and A. S. Moloodi. 2014. The First Semantic Role Corpus in Persian Language. Language Science 2 (3): 48-29. [In Persian]

Nazari, M. 2013. How Knowledge Gap Is Captured in Generative Research? A Proposal for Developing Two Maps: Knowledge Map and Research Map. Academic Librarianship and Information Research, 47 (1): 27-48. [In Persian]

Nezarat, A. and T. Mosavi Miangah. 2012. Designing and Implementing a Cross-Language Information Retrieval System Using Linguistic Corpora. Iranian Journal of Information Processing and Management 27 (2): 798-813. [In Persian]

Powell, Ronald R. 2010. Basic Research Methods for Librarians. (Hariri, N. translator). Tehran: Naafis Publications. [In Persian]

Pustejovsky, James, Sabine‏ Bergler, Peter Anick. 1993. Lexical Semantic Techniques for Corpus. Computational Linguistics 19 (2): 331-358.

Rezaeipanah, A., & S. Shokati-Mogharab. 2016. A corpus-based analysis of identity circuits in the UK National Security Strategy 2015. In A. Mirzaei (Ed.), Proceedings of the 2nd. national conference on corpus linguistics (pp. 69–91). Tehran: Neviseh Parsi Publishing. [In Persian]

Robati, Zahra. 2014. Persian News Classification Using Artificial Intelligence. MA Thesis, Shahrood University of Technology. [Shahrood]. [In Persian]

Rohanian, M., M. Salehi, A. Darzi, Vahid Ranjbar. 2020. Convolutional Neural Networks for Sentiment Analysis in Persian Social Media. Iranian Journal of Electrical and Computer Engineering 8 (1): 59-66. [In Persian]

Sabeti, Behnam, Hossein Abedi Firouzjaee, Ali Janalizadeh Choobasti, S.H.E. Mortazavi Najafabadi, & Amir Vaheb. 2018. MirasText: An Automatically Generated Text Corpus for Persian. In Proceedings of the Eleventh International Conferenece on Language Resources and Evaluation (LREC 2018), 1174-1177. Japan: European Language Resources Association (ELRA).

Safari, S. 2015. From corpus linguistics to learner corpus. In A. Mirzaei (ed.), Proceedings of the 1st. national conference on Croups linguistics (pp. 131–152). Tehran: Neviseh Parsi Publishing. [In Persian]

Safari, S. 2016. Learner corpus: Foundations, methodology, design and production model. In A. Mirzaei (ed.), Proceedings of the 2nd. national conference on corpus linguistics (pp. 93–123). Tehran: Neviseh Parsi Publishing. [In Persian]

Salami, M., Z. S. Jalali, M. Pakdaman Naeini, and M. Alaei Arani. 2015. Content Analysis of Medical Research Articles: A corpus-based study. Health Information Management 12 (5): 595-607. [In Persian]

Shahshahani M. S., M. Mohseni, A. Shakery, H. Faili. 2019. PAYMA: A Tagged Corpus of Persian Named Entities. Journal of Signal and Data Processing 16 (1): 91-110. [In Persian]

Sokolova, Marina, & Victoria Bobicev. 2018. Corpus Statistics in Text Classification of Online Data. Arxiv: 1803.06390.

Verma, Parul, and Brijesh Khandelwal. 2019. Word Embeddings and Its Application in Deep Learning. International Journal of Innovative Technology and Exploring Engineering (IJITEE) 8 (11): 337-341.

Zolfaghar, Z., T. Mosavi Miangah, B. Rovshan, and A. R. Vakilifard. 2020. A Study on the Improved Techniques of Corpus-based Frequency Approaches in Automatic Term Extraction (ATE) (The Case Study: Basic Medicine Vocabulary). Iranian Journal of Information Processing and Management 35 (4): 1039-1064. [In Persian]

Iranian Journal of Information Processing and Management

Volume 41, Issue 1 - Serial Number 126
Autumn 2025
Pages 71-98

XML

PDF 1006.6 K

Receive Date 28 June 2023
Revise Date 25 November 2023
Accept Date 25 November 2023

Article View 613
PDF Download 543

Iranian Journal of Information Processing and Management

Natural Language Text Corpus: Design, Construction and Management

Volume 41, Issue 1 - Serial Number 126Autumn 2025Pages 71-98

Files

History

Share

How to cite

Statistics

Volume 41, Issue 1 - Serial Number 126
Autumn 2025
Pages 71-98