ارائه روشی مبتنی بر ژنتیک برای رفع ابهام نام نویسندگان مقالات

نویسنده

مرکز منطقه‌ای اطلاع‌رسانی علوم و فناوری؛ پایگاه استنادی علوم جهان اسلام؛ شیراز، ایران؛

چکیده

امروزه، با افزایش روزافزون حجم مقالات از یک طرف و استفاده از اینترنت و خدمات موتورهای جست‌وجو از طرف دیگر، روش‌های ابهام‌زدایی از اسامی پژوهشگران بسیار مورد توجه قرار گرفته است. تاکنون روش‌های مختلفی برای حل این مشکل ارائه شده که هر یک مزایا و معایب خاص خود را دارند. هدف این مقاله ارائه راهکاری جهت شناسایی رکوردهای متعددی است که به یک نویسنده تعلق دارند. بدین‌منظور بعد از استخراج ویژگی‌های داخلی و خارجی نویسندگان، یک معیار جدید جهت مشخص‌کردن میزان مشابهت میان دو رکورد ارائه شده است. اهمیت هر یک از ویژگی‌های ارائه‌شده با استفاده از الگوریتمی مبتنی بر ژنتیک با دو تابع برازش مختلف تعیین می‌شود تا از طریق یادگیری نمونه‌های موجود بهینه‌ترین ضرایب به‌دست آید. روش پیشنهادی با دو تابع برازش روی داده‌های آزمایشی مورد ارزیابی و مقایسه قرار گرفته و نتایج حاصل نشان‌دهنده افزایش دقت در روش پیشنهادی با هر دو تابع برازش نسبت به روش‌ قبلی است.

کلیدواژه‌ها


عنوان مقاله [English]

A Genetic-based Approach for Author Name Disambiguation Problem

نویسنده [English]

  • Niloofar Mozafari
چکیده [English]

In the recent years, with the increasing volume of articles and the use of Internet and search engine services, the author name disambiguation problem has received a lot of attention. Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and also when there are multiple other authors with the same name. So far, various methods have been proposed to solve this problem, each of which has its own advantages and disadvantages. Despite years of research, the name disambiguation problem remains largely unresolved. In this study, we propose an algorithm to identify several records that belong to one author. For this purpose, a new criterion has been proposed to determine the similarity between the two records. Since this study addresses the approximate matching of authors’ records, the importance of the fields in each record is determined by the coefficients. In order to get the optimal coefficients, we propose a genetic algorithm to learn from the available samples. The proposed method has been evaluated with two fitness functions on experimental data and the results are promising.

کلیدواژه‌ها [English]

  • Name Disambiguation Problem
  • Levenshtein Distance
  • genetic algorithm
  • Fitness Function
  1. رزمی شندی، مسعود، یعقوب نوروزی، و مهدی علیپور حافظی. 1399. ارائه الگوی مفهومی به‌کارگیری اینترنت اشیا در خدمات نوین کتابخانه‌های دیجیتال ایران. پژوهشنامه پردازش و مدیریت اطلاعات ۳۵ (۳): ۶۹۳-۷۲۸.
  2. رزمی شندی، مسعود، یعقوب نوروزی، و مهدی علیپور حافظی. 1399. ارائه الگوی مفهومی به‌کارگیری اینترنت اشیا در خدمات نوین کتابخانه‌های دیجیتال ایران. پژوهشنامه پردازش و مدیریت اطلاعات ۳۵ (۳): ۶۹۳-۷۲۸.
  3. قاسمی الوری، مینا، و مظفر چشمه‌سهرابی. 1399. تحلیل کمی و انتقادی پژوهش‌های حوزه کتابخانه‌های دیجیتالی در ایران، پژوهشنامه پردازش و مدیریت اطلاعات 4 (35): 921-952.
  4. قاسمی الوری، مینا، و مظفر چشمه‌سهرابی. 1399. تحلیل کمی و انتقادی پژوهش‌های حوزه کتابخانه‌های دیجیتالی در ایران، پژوهشنامه پردازش و مدیریت اطلاعات 4 (35): 921-952.
  5. مرتضوی، سید محمد، محمدحسین ندیمی شهرکی، و مصطفی موسی‌خانی. 1396. بهبود صحت ابهام‌زدایی نام نویسنده با استفاده از خوشه‌بندی تجمعی. پردازش علائم و داده‌ها 34 (4): 117-127.
  6. مرتضوی، سید محمد، محمدحسین ندیمی شهرکی، و مصطفی موسی‌خانی. 1396. بهبود صحت ابهام‌زدایی نام نویسنده با استفاده از خوشه‌بندی تجمعی. پردازش علائم و داده‌ها 34 (4): 117-127.
  7. مزروعی سبدانی، نصیرالدین، حسین ابراهیم‌پور کومله، و علی‌محمد نیک‌فرجام. 1392. ارائه روش با نظارت به‌منظور دسته‌بندی مقالات با وجود ابهام در داده‌ها. دوازدهمین کنفرانس سیستم‌های هوشمند ایران، مجتمع آموزش عالی بم.
  8. مزروعی سبدانی، نصیرالدین، حسین ابراهیم‌پور کومله، و علی‌محمد نیک‌فرجام. 1392. ارائه روش با نظارت به‌منظور دسته‌بندی مقالات با وجود ابهام در داده‌ها. دوازدهمین کنفرانس سیستم‌های هوشمند ایران، مجتمع آموزش عالی بم.
  9. Bekkerman, R., & A. McCallum. 2005. Disambiguating web appearances of people in a social network. In Proceedings of the 14th international conference on World Wide Web, pp. 463-470. Chiba, Japan. [DOI:10.1145/1060745.1060813]
  10. Bekkerman, R., & A. McCallum. 2005. Disambiguating web appearances of people in a social network. In Proceedings of the 14th international conference on World Wide Web, pp. 463-470. Chiba, Japan. [DOI:10.1145/1060745.1060813]
  11. Breiman, L. 2017. Classification and regression trees. Routledge. [DOI:10.1201/9781315139470]
  12. Breiman, L. 2017. Classification and regression trees. Routledge. [DOI:10.1201/9781315139470]
  13. Fan, Xiaoming, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. 2011. On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ) 2 (2): 1-23. [DOI:10.1145/1891879.1891883]
  14. Fan, Xiaoming, Jianyong Wang, Xu Pu, Lizhu Zhou, and Bing Lv. 2011. On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ) 2 (2): 1-23. [DOI:10.1145/1891879.1891883]
  15. Giles, C. Lee, Hongyuan Zha, and Hui Han. 2005. Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries (JCDL'05), pp. 334-343. IEEE. Denver, CO USA.
  16. Giles, C. Lee, Hongyuan Zha, and Hui Han. 2005. Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries (JCDL'05), pp. 334-343. IEEE. Denver, CO USA.
  17. Han, Donghong, Siqi Liu, Yachao Hu, Bin Wang, and Yongjiao Sun. 2015. ELM-based name disambiguation in bibliography. World Wide Web 18 (2): 253-263. [DOI:10.1007/s11280-013-0226-4]
  18. Han, Donghong, Siqi Liu, Yachao Hu, Bin Wang, and Yongjiao Sun. 2015. ELM-based name disambiguation in bibliography. World Wide Web 18 (2): 253-263. [DOI:10.1007/s11280-013-0226-4]
  19. Han, Hui, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries. pp. 296-305. IEEE. Tuscon AZ USA. [DOI:10.1145/996350.996419]
  20. Han, Hui, Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. 2004. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries. pp. 296-305. IEEE. Tuscon AZ USA. [DOI:10.1145/996350.996419]
  21. Hodge, Victoria J., and Jim Austin. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE transactions on knowledge and data engineering 15 (5): 1073-1081. [DOI:10.1109/TKDE.2003.1232265]
  22. Hodge, Victoria J., and Jim Austin. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE transactions on knowledge and data engineering 15 (5): 1073-1081. [DOI:10.1109/TKDE.2003.1232265]
  23. Holmes, David, and M. Catherine McCabe. 2002. Improving precision and recall for soundex retrieval. In Proceedings. International Conference on Information Technology: Coding and Computing, pp. 22-26. IEEE. Las Vegas, Nevada.
  24. Holmes, David, and M. Catherine McCabe. 2002. Improving precision and recall for soundex retrieval. In Proceedings. International Conference on Information Technology: Coding and Computing, pp. 22-26. IEEE. Las Vegas, Nevada.
  25. Huang, Jian, Seyda Ertekin, and C. Lee Giles. 2006. Efficient name disambiguation for large-scale databases. In European conference on principles of data mining and knowledge discovery, pp. 536-544. Berlin, Heidelberg: Springer. [DOI:10.1007/11871637_53]
  26. Huang, Jian, Seyda Ertekin, and C. Lee Giles. 2006. Efficient name disambiguation for large-scale databases. In European conference on principles of data mining and knowledge discovery, pp. 536-544. Berlin, Heidelberg: Springer. [DOI:10.1007/11871637_53]
  27. Hussain, Ijaz, and Sohail Asghar. 2017. A survey of author name disambiguation techniques: 2010-2016. The Knowledge Engineering Review 32: e22 [DOI:10.1017/S0269888917000182]
  28. Hussain, Ijaz, and Sohail Asghar. 2017. A survey of author name disambiguation techniques: 2010-2016. The Knowledge Engineering Review 32: e22 [DOI:10.1017/S0269888917000182]
  29. Huynh, Tin, Kiem Hoang, Tien Do, and Duc Huynh. 2013. Vietnamese author name disambiguation for integrating publications from heterogeneous sources." In Asian Conference on Intelligent Information and Database Systems, pp. 226-235. Berlin, Heidelberg: Springer. [DOI:10.1007/978-3-642-36546-1_24]
  30. Huynh, Tin, Kiem Hoang, Tien Do, and Duc Huynh. 2013. Vietnamese author name disambiguation for integrating publications from heterogeneous sources." In Asian Conference on Intelligent Information and Database Systems, pp. 226-235. Berlin, Heidelberg: Springer. [DOI:10.1007/978-3-642-36546-1_24]
  31. Imran, Muhammad, Syed Gillani, and Maurizio Marchese. 2013. A real-time heuristic-based unsupervised method for name disambiguation in digital libraries. D-Lib Magazine 19 (9):1. [DOI:10.1045/september2013-imran]
  32. Imran, Muhammad, Syed Gillani, and Maurizio Marchese. 2013. A real-time heuristic-based unsupervised method for name disambiguation in digital libraries. D-Lib Magazine 19 (9):1. [DOI:10.1045/september2013-imran]
  33. Lait, Andrew J., and Brian Randell. 1996. An assessment of name matching algorithms. Technical Report Series.University of Newcastle upon Tyne Computing Science.
  34. Lait, Andrew J., and Brian Randell. 1996. An assessment of name matching algorithms. Technical Report Series.University of Newcastle upon Tyne Computing Science.
  35. Navarro, Gonzalo. 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR) 33 (1): 31-88. [DOI:10.1145/375360.375365]
  36. Navarro, Gonzalo. 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR) 33 (1): 31-88. [DOI:10.1145/375360.375365]
  37. Niwattanakul, S., J. Singthongchai, E. Naenudorn, and S. Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the international multi-conference of engineers and computer scientists 1 (6): 380-384.
  38. Niwattanakul, S., J. Singthongchai, E. Naenudorn, and S. Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the international multi-conference of engineers and computer scientists 1 (6): 380-384.
  39. Philips, Lawrence. 2000. The double metaphone search algorithm. C/C++ Users Journal 18 (6): 38-43.
  40. Philips, Lawrence. 2000. The double metaphone search algorithm. C/C++ Users Journal 18 (6): 38-43.
  41. Sayers, Adrian. 2014. NYSIIS: Stata module to calculate nysiis codes from string variables. Statistical Software Components S457936, Boston: College Department of Economics. Revised 21 Jul 2018.
  42. Sayers, Adrian. 2014. NYSIIS: Stata module to calculate nysiis codes from string variables. Statistical Software Components S457936, Boston: College Department of Economics. Revised 21 Jul 2018.
  43. Seol, Jae-Wook, Seok-Hyoung Lee, and Kwang-Young Kim. 2016. Author disambiguation using co-author network and supervised learning approach in scholarly data. International Journal of Software Engineering and Its Applications 10 (4): 73-82. [DOI:10.14257/ijseia.2016.10.4.08]
  44. Seol, Jae-Wook, Seok-Hyoung Lee, and Kwang-Young Kim. 2016. Author disambiguation using co-author network and supervised learning approach in scholarly data. International Journal of Software Engineering and Its Applications 10 (4): 73-82. [DOI:10.14257/ijseia.2016.10.4.08]
  45. Shin, Dongwook, Taehwan Kim, Joongmin Choi, and Jungsun Kim. 2014. 2014. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100 (1): 15-50. [DOI:10.1007/s11192-014-1289-4]
  46. Shin, Dongwook, Taehwan Kim, Joongmin Choi, and Jungsun Kim. 2014. 2014. Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics 100 (1): 15-50. [DOI:10.1007/s11192-014-1289-4]
  47. Song, Yang, Jian Huang, Isaac G. Councill, Jia Li, and C. Lee Giles. 2007. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 342-351. Vancouver BC Canada. [DOI:10.1145/1255175.1255243]
  48. Song, Yang, Jian Huang, Isaac G. Councill, Jia Li, and C. Lee Giles. 2007. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 342-351. Vancouver BC Canada. [DOI:10.1145/1255175.1255243]
  49. Tang, Jie, Alvis CM Fong, Bo Wang, and Jing Zhang. 2011. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24 (6): 975-987. [DOI:10.1109/TKDE.2011.13]
  50. Tang, Jie, Alvis CM Fong, Bo Wang, and Jing Zhang. 2011. A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24 (6): 975-987. [DOI:10.1109/TKDE.2011.13]
  51. Tejada, Sheila, Craig A. Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26 (8): 607-633. [DOI:10.1016/S0306-4379(01)00042-4]
  52. Tejada, Sheila, Craig A. Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26 (8): 607-633. [DOI:10.1016/S0306-4379(01)00042-4]
  53. Torvik, Vetle I., Marc Weeber, Don R. Swanson, and Neil R. Smalheiser. 2005. A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for information science and technology 56 (2): 140-158. [DOI:10.1002/asi.20105]
  54. Torvik, Vetle I., Marc Weeber, Don R. Swanson, and Neil R. Smalheiser. 2005. A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for information science and technology 56 (2): 140-158. [DOI:10.1002/asi.20105]
  55. Tran, Hung Nghiep, Tin Huynh, and Tien Do. 2014. Author name disambiguation by using deep neural network. In Asian Conference on Intelligent Information and Database Systems, pp. 123-132. Cham: Springer. [DOI:10.1007/978-3-319-05476-6_13]
  56. Tran, Hung Nghiep, Tin Huynh, and Tien Do. 2014. Author name disambiguation by using deep neural network. In Asian Conference on Intelligent Information and Database Systems, pp. 123-132. Cham: Springer. [DOI:10.1007/978-3-319-05476-6_13]
  57. Wang, Xuezhi, Jie Tang, Hong Cheng, and S. Yu Philip. 2011. Adana: Active name disambiguation. In 2011 IEEE 11th international conference on data mining, pp. 794-803. IEEE. Vancouver, British Columbia, Canada. [DOI:10.1109/ICDM.2011.19]
  58. Wang, Xuezhi, Jie Tang, Hong Cheng, and S. Yu Philip. 2011. Adana: Active name disambiguation. In 2011 IEEE 11th international conference on data mining, pp. 794-803. IEEE. Vancouver, British Columbia, Canada. [DOI:10.1109/ICDM.2011.19]
  59. Wang, Jian, Kaspars Berzins, Diana Hicks, Julia Melkers, Fang Xiao, and Diogo Pinheiro. 2012. A boosted-trees method for name disambiguation. Scientometrics 93 (2): 391-411. [DOI:10.1007/s11192-012-0681-1]
  60. Wang, Jian, Kaspars Berzins, Diana Hicks, Julia Melkers, Fang Xiao, and Diogo Pinheiro. 2012. A boosted-trees method for name disambiguation. Scientometrics 93 (2): 391-411. [DOI:10.1007/s11192-012-0681-1]
  61. Zobel, Justin, and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 166-172. Zurich Switzerland. [DOI:10.1145/243199.243258]
  62. Zobel, Justin, and Philip Dart. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 166-172. Zurich Switzerland. [DOI:10.1145/243199.243258]