Multi-level Persian Dataset for Information Retrieval

Abedzadeh, Ali; Ramezani, Reza; Fatemi Khorasgani, Afsaneh

doi:10.22034/jipm.2024.710246

Multi-level Persian Dataset for Information Retrieval

Document Type : Original Article

Authors

Ali Abedzadeh ¹

Reza Ramezani ²

Afsaneh Fatemi Khorasgani ³

¹ Master of Software Engineering; Faculty of Computer Engineering; University of Isfahan

² Ph.D. in Computer Engineering; Associate Professor; Faculty of Computer Engineering; University of Isfahan.

³ Ph.D. in Computer Engineering ; Associate Professor; Faculty of Computer Engineering; University of Isfahan.

10.22034/jipm.2024.710246

Abstract

An information retrieval system tries to retrieve documents related to a question/query. The retrieval is done from a large
Information retrieval systems are an essential part of many smart systems. The applications of this research field include search engines such as Google and Bing, question-answering systems, modern databases, etc. An information retrieval system tries to retrieve documents related to a question/query. The retrieval is done from a large collection of documents, and the size of this collection can be from a few thousand documents to millions of documents. In recent years, a lot of research has been done to develop information retrieval systems using language models. However, in this research field, no research has been done for the Persian language. One of its main reasons is the lack of a suitable Persian dataset for training language models. In this research, first, a Persian dataset for information retrieval is presented. After that, methods for enriching this data set are investigated. This enrichment is done by defining multi-level relationships between a document and a question. In this regard, the new dataset can show the relationship between question and document in four levels (unrelated, related, highly related, completely related) instead of two levels (completely unrelated, completely related). The name of the generated dataset is PersianMLIR. Experiments show that by using multi-level relationships, the performance of the system improves for both Persian and English languages, where the improvement is 1.87% for the Persian language. The results conclude that enriching information retrieval datasets by increasing the number of relations between query and document lead to improving the performance of information retrieval systems.

Keywords

Information Retrieval

Language Models

Information Retrieval Dataset

Persian Dataset

Subjects

Other New Fields and Topics in Information and Knowledge Management

References:

Abadani, Negin, Jamshid Mozafari, Afsaneh Fatemi, Mohammd Ali Nematbakhsh, and Arefeh Kazemi. 2021. “ParSQuAD: Machine Translated SQuAD Dataset for Persian Question Answering.” in 2021 7th International Conference on Web Research (ICWR). IEEE. Tehran, Iran.

Ayoubi Sajjad & Mohammad Yasin ِ Davoodeh. 2021. PersianQA: A Dataset for Persian Question Answering. GitHub Repository.

Bajaj, Payal, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, and Tri Nguyen. 2016. “Ms Marco: A Human Generated Machine Reading Comprehension Dataset.” ArXiv Preprint ArXiv: 1611.09268.

Craswell, Nick, Bhaskar Mitra, Emine Yilmaz, and Daniel Campos. 2021a. “Overview of the TREC 2020 Deep Learning Track.” ArXiv Preprint ArXiv: 2102.07662.

_____, and Jimmy Lin. 2021b. “Ms Marco: Benchmarking Ranking Models in the Large-Data Regime.” pp. 1566–76 in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.

Craswell, Nick, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. “Overview of the TREC 2019 Deep Learning Track.” ArXiv Preprint ArXiv: 2003.07820.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” pp. 4171–86 in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics.

Hashemi, Helia, Mohammad Aliannejadi, Hamed Zamani, and W. Bruce Croft. 2020. “ANTIQUE: A Non-Factoid Question Answering Benchmark.” Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12036 LNCS: 166–73. doi: 10.1007/978-3-030-45442-5_21.

Johnson, Jeff, Matthijs Douze, and Hervé Jégou. 2019. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data 7 (3): 535–547.

Karpukhin, Vladimir, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. “Dense Passage Retrieval for Open-Domain Question Answering.” pp. 6769–81 in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Kazemi, Arefeh, Jamshid Mozafari, and Mohammad Ali Nematbakhsh. 2022. “PersianQuAD: The Native Question Answering Dataset for the Persian Language.” IEEE Access 10: 26045–57. doi: 10.1109/ACCESS.2022.3157289.

Khashabi, Daniel, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, and Sarik Ghazarian. 2021. ParsiNLU: A Suite of Language Understanding Challenges for Persian. Transactions of the Association for Computational Linguistics 9: 1163–1178.

Kwiatkowski, Tom, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, and Kenton Lee. 2019. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7: 453–466.

Lin, Jimmy, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. “Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations.” pp. 2356–62 in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.

Liu, Ye, Kazuma Hashimoto, Yingbo Zhou, Semih Yavuz, Caiming Xiong, and S. Yu Philip. 2021a. “Dense Hierarchical Retrieval for Open-Domain Question Answering.” pp. 188–200 in Findings of the Association for Computational Linguistics: EMNLP 2021.

Liu, Zhenghao, Kaitao Zhang, Chenyan Xiong, Zhiyuan Liu, and Maosong Sun. 2021b. “OpenMatch: An Open Source Library for NEU-IR Research.” pp. 2531–35 in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.

Mitra, Bhaskar, and Nick Craswell. 2018. An Introduction to Neural Information Retrieval. Foundations and Trends® in Information Retrieval 13 (1): 1–126.

Qu, Yingqi, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2021. “RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering.” pp. 5835–47 in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Rajpurkar, Pranav, Robin Jia, and Percy Liang. 2018. “Know What You Don’t Know: Unanswerable Questions for SQuAD.” pp. 784–89 in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Texas, USA.

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” pp. 2383–92 in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Melbourne, Australia

Robertson, Stephen, and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval 3 (4): 333–389.

Salton, Gerard, and Christopher Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval. Information Processing & Management 24 (5): 513–523.

Xia, Fen, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. “Listwise Approach to Learning to Rank: Theory and Algorithm.” pp. 1192–99 in Proceedings of the 25th international conference on Machine learning. Tokyo, Japan.

Yang, Peilin, Hui Fang, and Jimmy Lin. 2017. “Anserini: Enabling the Use of Lucene for Information Retrieval Research.” pp. 1253–56 in Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. Toyo, Japan.

Zhang, Xinyu, Andrew Yates, and Jimmy Lin. 2020. “A Little Bit is Worse than None: Ranking with Limited Training Data.” pp. 107–12 in Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing.

Iranian Journal of Information Processing and Management

Volume 39, Issue 3 - Serial Number 118
Spring 2024
Pages 1109-1137

XML

PDF 1.04 M

Receive Date 18 March 2023
Revise Date 16 November 2023
Accept Date 18 November 2023

Article View 521
PDF Download 472

Iranian Journal of Information Processing and Management

Multi-level Persian Dataset for Information Retrieval

Volume 39, Issue 3 - Serial Number 118Spring 2024Pages 1109-1137

Files

History

Share

How to cite

Statistics

Volume 39, Issue 3 - Serial Number 118
Spring 2024
Pages 1109-1137