Extraction of Effective Textual and Semantic Features in Learning to Rank for Web Document Retrieval

Authors

Abstract

Ranking algorithms, as the core of web search systems, are responsible for finding and ranking the most relevant documents to user information needs from the crawled and indexed corpus. With the ever-increasing amount of available training data, ranking technologies are moving towards using Machine Learning methods, described as Learning to Rank algorithms. The basic Learning to Rank systems mainly have used textual features while ignoring semantic features. With the advent of Semantic Web, there is an emerging interest in developing and using semantic features for Learning to Rank systems. An important challenge is that there is currently no comprehensive study on the combined usage of textual and semantic features for Learning to Rank systems. In this paper, first, we define and implement four new sets of semantic features based on Knowledge Graph, Entity Repetition, Textual Fields and Vector Representation of Words and Texts. For experimental analysis, we used the MQ-2007 dataset from LETOR 4, which includes a set of textual features. The results of running six standard Learning to Rank Algorithms show that by using semantic features, either in isolation or in combination with textual features, significantly increases the performance. The increase in performance is even more significant when we limit the tests to hard queries. We also implemented an existing Feature Selection algorithm to test whether it can improve the results even further. The results showed improvements for some Learning to Rank algorithms, yet failed to improve on others.

Keywords


  1. Ai, Q., K. Bi, J. Guo, & W. Bruce Croft. 2018. Learning a deep listwise context model for ranking refinement. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval [DOI:10.1145/3209978.3209985]
  2. Ai, Q., K. Bi, J. Guo, & W. Bruce Croft. 2018. Learning a deep listwise context model for ranking refinement. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval [DOI:10.1145/3209978.3209985]
  3. Ai, Q., X. Wang, N. Asadi, N. Golbandi, M. Bendersky, & M.-A. Najor. 2019. Learning groupwise multivariate scoring functions using deep neural networks. In Proceedings of the ACM SIGIR international conference on theory of information retrieval, pp. 85-92. Santa Clara, CA, USA. [DOI:10.1145/3341981.3344218]
  4. Ai, Q., X. Wang, N. Asadi, N. Golbandi, M. Bendersky, & M.-A. Najor. 2019. Learning groupwise multivariate scoring functions using deep neural networks. In Proceedings of the ACM SIGIR international conference on theory of information retrieval, pp. 85-92. Santa Clara, CA, USA. [DOI:10.1145/3341981.3344218]
  5. Bendersky, M., W. B. Croft, and Y. Diao. 2011. Quality-biased ranking of web documents. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 95-104. Hong Kong, China. [DOI:10.1145/1935826.1935849]
  6. Bendersky, M., W. B. Croft, and Y. Diao. 2011. Quality-biased ranking of web documents. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 95-104. Hong Kong, China. [DOI:10.1145/1935826.1935849]
  7. Breiman, L. 2001. Random forests. Machine learning 45: 5-32. [DOI:10.1023/A:1010933404324]
  8. Breiman, L. 2001. Random forests. Machine learning 45: 5-32. [DOI:10.1023/A:1010933404324]
  9. Chen, J., C. Xiong, and J. Callan. 2016. An empirical study of learning to rank for entity search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 737-740. Pisa, Italy. [DOI:10.1145/2911451.2914725]
  10. Chen, J., C. Xiong, and J. Callan. 2016. An empirical study of learning to rank for entity search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 737-740. Pisa, Italy. [DOI:10.1145/2911451.2914725]
  11. Dali, L. , B. Fortuna, T. Duc, and D. Mladenić. 2012. Query-independent learning to rank for RDF entity search. The semantic web: Research and applications, pp. 484-498. [DOI:10.1007/978-3-642-30284-8_39]
  12. Dali, L. , B. Fortuna, T. Duc, and D. Mladenić. 2012. Query-independent learning to rank for RDF entity search. The semantic web: Research and applications, pp. 484-498. [DOI:10.1007/978-3-642-30284-8_39]
  13. Ensan, F. , E. Bagheri, A. Zouaq, and A. Kouznetsov. 2017. An empirical study of embedding features in learning to rank. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2059-2. New York, United States. [DOI:10.1145/3132847.3133138]
  14. Ensan, F. , E. Bagheri, A. Zouaq, and A. Kouznetsov. 2017. An empirical study of embedding features in learning to rank. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2059-2. New York, United States. [DOI:10.1145/3132847.3133138]
  15. Ferragina, P. , and U. Scaiella. 2010. Tagme: on-the-fly annotation of short text fragments (by Wikipedia entities). In Proceeding of the 19th ACM international conference on Information and knowledge management, pp. 1625-1628. Toronto, Ontario, Canada. [DOI:10.1145/1871437.1871689]
  16. Ferragina, P. , and U. Scaiella. 2010. Tagme: on-the-fly annotation of short text fragments (by Wikipedia entities). In Proceeding of the 19th ACM international conference on Information and knowledge management, pp. 1625-1628. Toronto, Ontario, Canada. [DOI:10.1145/1871437.1871689]
  17. Freund, Y. , R. Iyer, R. E. Schapire, and Y. Singer. 2003. An efficient boosting algorithm for combining preferences. Journal of machine learning research 4: 933-969.
  18. Freund, Y. , R. Iyer, R. E. Schapire, and Y. Singer. 2003. An efficient boosting algorithm for combining preferences. Journal of machine learning research 4: 933-969.
  19. Geng, X., T.-Y. Liu, T. Qin, and H. Li. 2007. Feature selection for ranking. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 407-414. Beijing China. [DOI:10.1145/1277741.1277811]
  20. Geng, X., T.-Y. Liu, T. Qin, and H. Li. 2007. Feature selection for ranking. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 407-414. Beijing China. [DOI:10.1145/1277741.1277811]
  21. Guha, R., R. McCool, and E. Miller. 2003. Semantic search," in Proceedings of the 12th international conference on World Wide Web, pp. 700-709. New York, United States. [DOI:10.1145/775152.775250]
  22. Guha, R., R. McCool, and E. Miller. 2003. Semantic search," in Proceedings of the 12th international conference on World Wide Web, pp. 700-709. New York, United States. [DOI:10.1145/775152.775250]
  23. Han, Z., X. Li, M. Yang, H. Qi, and S. Li, 2013. Feature analysis in microblog retrieval based on learning to rank. In Natural Language Processing and Chinese Computing, ed: Springer, pp. 410-416. [DOI:10.1007/978-3-642-41644-6_40]
  24. Han, Z., X. Li, M. Yang, H. Qi, and S. Li, 2013. Feature analysis in microblog retrieval based on learning to rank. In Natural Language Processing and Chinese Computing, ed: Springer, pp. 410-416. [DOI:10.1007/978-3-642-41644-6_40]
  25. Hang, H. 2011. A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems 94: 1854-1862. [DOI:10.1587/transinf.E94.D.1854]
  26. Hang, H. 2011. A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems 94: 1854-1862. [DOI:10.1587/transinf.E94.D.1854]
  27. Li, H. 2011. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies 4: 1-113. [DOI:10.2200/S00348ED1V01Y201104HLT012]
  28. Li, H. 2011. Learning to rank for information retrieval and natural language processing. Synthesis Lectures on Human Language Technologies 4: 1-113. [DOI:10.2200/S00348ED1V01Y201104HLT012]
  29. Liu, T.-Y. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3: 225-331. [DOI:10.1561/1500000016]
  30. Liu, T.-Y. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3: 225-331. [DOI:10.1561/1500000016]
  31. _____. 2011. Learning to rank for information retrieval. Berlin Heidelberg: Springer Science & Business Media.
  32. _____. 2011. Learning to rank for information retrieval. Berlin Heidelberg: Springer Science & Business Media.
  33. Macdonald, C. , B. T. Dinçer, and I. Ounis, 2015. Transferring learning to rank models for web search. In Proceedings of the 2015 international conference on the theory of information retrieval, pp. 41-50. New York, United States. [DOI:10.1145/2808194.2809463]
  34. Macdonald, C. , B. T. Dinçer, and I. Ounis, 2015. Transferring learning to rank models for web search. In Proceedings of the 2015 international conference on the theory of information retrieval, pp. 41-50. New York, United States. [DOI:10.1145/2808194.2809463]
  35. Macdonald, C. , R. L. Santos, and I. Ounis, 2013. The whens and hows of learning to rank for web search. Information Retrieval 16: 584-628. [DOI:10.1007/s10791-012-9209-9]
  36. Macdonald, C. , R. L. Santos, and I. Ounis, 2013. The whens and hows of learning to rank for web search. Information Retrieval 16: 584-628. [DOI:10.1007/s10791-012-9209-9]
  37. Maio, C.-D. , G. Fenza, M. Gallo, V. Loia, & M. Parente. 2019. Time-aware adaptive tweets ranking through deep learning. Future Generation Computer Systems 93: 924-932. [DOI:10.1016/j.future.2017.07.039]
  38. Maio, C.-D. , G. Fenza, M. Gallo, V. Loia, & M. Parente. 2019. Time-aware adaptive tweets ranking through deep learning. Future Generation Computer Systems 93: 924-932. [DOI:10.1016/j.future.2017.07.039]
  39. Metzler, D., and W. B. Croft. 2007. Linear feature-based models for information retrieval. Information Retrieval 10: 257-274. [DOI:10.1007/s10791-006-9019-z]
  40. Metzler, D., and W. B. Croft. 2007. Linear feature-based models for information retrieval. Information Retrieval 10: 257-274. [DOI:10.1007/s10791-006-9019-z]
  41. Pasumarthi, R.-K. , S. Bruch, X. Wang, C. Li, M. Bendersky, M. Najork, J. Pfeifer, N. Golbandi, R. Anil, & S. Wolf. 2019. Tf-ranking: scalable tensorflow library for learning-to-rank. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2970-2978). [DOI:10.1145/3292500.3330677]
  42. Pasumarthi, R.-K. , S. Bruch, X. Wang, C. Li, M. Bendersky, M. Najork, J. Pfeifer, N. Golbandi, R. Anil, & S. Wolf. 2019. Tf-ranking: scalable tensorflow library for learning-to-rank. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2970-2978). [DOI:10.1145/3292500.3330677]
  43. Phophalia, A. 2011. A survey on learning to rank (letor) approaches in information retrieval, in Engineering (NUiCONE), 2011 Nirma University International Conference on, pp. 1-6. [DOI:10.1109/NUiConE.2011.6153228]
  44. Phophalia, A. 2011. A survey on learning to rank (letor) approaches in information retrieval, in Engineering (NUiCONE), 2011 Nirma University International Conference on, pp. 1-6. [DOI:10.1109/NUiConE.2011.6153228]
  45. Qin, T., and T.-Y. Liu. 2013. Introducing letor 4.0 datasets. arXiv preprint arXiv:1306.2597.
  46. Qin, T., and T.-Y. Liu. 2013. Introducing letor 4.0 datasets. arXiv preprint arXiv:1306.2597.
  47. _____, J. Xu, and H. Li. 2010. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13: 346-374. [DOI:10.1007/s10791-009-9123-y]
  48. _____, J. Xu, and H. Li. 2010. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13: 346-374. [DOI:10.1007/s10791-009-9123-y]
  49. Schuhmacher, M. , L. Dietz, and S. P Ponzetto. 2015. Ranking entities for web queries through text and knowledge. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1461-1470. New York, United States. [DOI:10.1145/2806416.2806480]
  50. Schuhmacher, M. , L. Dietz, and S. P Ponzetto. 2015. Ranking entities for web queries through text and knowledge. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1461-1470. New York, United States. [DOI:10.1145/2806416.2806480]
  51. Semeraro, G. 2016. Learning to Rank Entity Relatedness Through Embedding-Based Features. In Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings, p. 471.
  52. Semeraro, G. 2016. Learning to Rank Entity Relatedness Through Embedding-Based Features. In Natural Language Processing and Information Systems: 21st International Conference on Applications of Natural Language to Information Systems, NLDB 2016, Salford, UK, June 22-24, 2016, Proceedings, p. 471.
  53. Soldaini, L., and N. Goharian. 2017. Learning to rank for consumer health search: a semantic approach. In European Conference on Information Retrieval, pp. 640-646. Aberdeen, United Kingdom. [DOI:10.1007/978-3-319-56608-5_60]
  54. Soldaini, L., and N. Goharian. 2017. Learning to rank for consumer health search: a semantic approach. In European Conference on Information Retrieval, pp. 640-646. Aberdeen, United Kingdom. [DOI:10.1007/978-3-319-56608-5_60]
  55. Wu, Q., C. J. Burges, K. M. Svore, and J. Gao. 2010. Adapting boosting for information retrieval measures. Information Retrieval 13: 254-270. DOI 10.1007/s10791-009-9112-1 [DOI:10.1007/s10791-009-9112-1]
  56. Wu, Q., C. J. Burges, K. M. Svore, and J. Gao. 2010. Adapting boosting for information retrieval measures. Information Retrieval 13: 254-270. DOI 10.1007/s10791-009-9112-1 [DOI:10.1007/s10791-009-9112-1]
  57. Xiong C., and J. Callan. 2015. Esdrank: Connecting query and documents through external semi-structured data. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 951-960. Melbourne, Australia. [DOI:10.1145/2806416.2806456]
  58. Xiong C., and J. Callan. 2015. Esdrank: Connecting query and documents through external semi-structured data. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 951-960. Melbourne, Australia. [DOI:10.1145/2806416.2806456]
  59. Xu, J., and H. Li, 2007. Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 391-398. Amsterdam The Netherlands. [DOI:10.1145/1277741.1277809]
  60. Xu, J., and H. Li, 2007. Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 391-398. Amsterdam The Netherlands. [DOI:10.1145/1277741.1277809]
  61. Zhao, L., M. Li, J. Kou, J. Zhang, & Y. Zhang. 2020. A framework for event-oriented text retrieval based on temporal aspects: a recent review. In Proceedings of the 12th International Conference on Machine Learning and Computing, pp. 39-46. Shenzhen China. [DOI:10.1145/3383972.3384051]
  62. Zhao, L., M. Li, J. Kou, J. Zhang, & Y. Zhang. 2020. A framework for event-oriented text retrieval based on temporal aspects: a recent review. In Proceedings of the 12th International Conference on Machine Learning and Computing, pp. 39-46. Shenzhen China. [DOI:10.1145/3383972.3384051]
  63. Zheng, H.-T., Q. Li, Y. Jiang, S.-T. Xia, and L. Zhang. 2013. Exploiting multiple features for learning to rank in expert finding. In International Conference on Advanced Data Mining and Applications, pp. 219-230. Hangzhou, China. [DOI:10.1007/978-3-642-53917-6_20]
  64. Zheng, H.-T., Q. Li, Y. Jiang, S.-T. Xia, and L. Zhang. 2013. Exploiting multiple features for learning to rank in expert finding. In International Conference on Advanced Data Mining and Applications, pp. 219-230. Hangzhou, China. [DOI:10.1007/978-3-642-53917-6_20]