Iranian Journal of Information Processing and Management

Iranian Journal of Information Processing and Management

Performance Evaluation and Accuracy Improvement in Individual Record Linking Problems Using Decision Tree Algorithm in Machine Learning

Document Type : Exploring the Relationship between Data Quality and Business Process Management

Authors
1 Department of Industrial Engineering, Payame Noor University, PO Box 3697-19395, Tehran, Iran
2 Associate Professor; Department of Industrial Engineering; Payam Noor University; Tehran
3 Assistant Professor; Department of Industrial Engineering; Payam Noor University
4 Associate Professor of Statistics Department; Research Institute of Statistics
Abstract
Record linkage is vital for consolidating data from different sources, particularly in Persian records where diverse data structures and formats present challenges. To tackle these complexities, an expert system with decision tree algorithms is crucial for ensuring precise record linkage and data aggregation. Adaptation operations are created based on predefined rules by incorporating decision trees into an expert system framework, simplifying the aggregation of disparate data sources. This method surpasses traditional approaches like IF-THEN rules in effectiveness and ease of use and improves accessibility for non-technical users due to its intuitive nature. Integrating probabilistic record linkage results into the decision tree model within the expert system automates the linkage process, allowing users to customize string metrics and thresholds for optimal outcomes. The model’s accuracy rate of over 95% on test data highlights its effectiveness in predicting and adjusting to data variations, confirming its reliability in various record linkage scenarios. The innovative utilization of machine learning decision trees alongside probabilistic record linkage in an expert system represents a significant advancement in the field, providing a robust solution for data aggregation in intricate environments and large-scale projects involving Persian records. Combining decision tree algorithms and probabilistic record linkage within an expert system offers a powerful tool for handling complex data integration tasks. This approach not only streamlines the process of consolidating diverse data sources but also enhances the accuracy and efficiency of record linkage operations By leveraging machine learning techniques and automated decision-making processes, organizations can achieve significant improvements in data quality and consistency, paving the way for more reliable and insightful analytical results in implementing statistical registers. In conclusion, integrating decision trees and probabilistic record linkage in an expert system represents a cutting-edge solution for addressing data aggregation challenges in Persian records and beyond.
Keywords
Subjects

References
Fattoum, N., Issaoui, D.-E., & Moussaoui, M. A. (2020, January 28). A hybrid approach for duplicate detection in big data using blocking and decision tree [arXiv]. arXiv. https://doi.org/10.48550/arXiv.2001.08012
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2019). A survey of methods for explaining black box models. ACM Computing Surveys, 51(5), 93. https://doi.org/10.1145/3287560
Jiang, N., Desruisseaux, L., & Swanson, D. A. (2021). A blocking approach to Metaphone-enhanced record linkage for public health data. International Journal of Environmental Research and Public Health, 18(12), 6338. https://doi.org/10.3390/ijerph18126338
Li, J., Zhu, Y., & Wang, H. (2021, September). Explainable decision tree for record linkage with feature importance analysis. In Proceedings of the 2021 International Conference on Big Data (pp. 123-132).
Li, J., Zhu, Y., & Wang, H. (2023). Flexible threshold setting for decision tree-based record linkage. Knowledge and Information Systems (In Press). [DOI: to be added when available]
Li, J., Zhu, Y., & Wang, H. (2022, April). Improving decision tree performance for record linkage using active learning and cost-sensitive learning. In Proceedings of the 2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1702-1711). https://doi.org/10.1145/3479029.3479103
Li, J., Zhu, Y., & Wang, H. (2022, April). Enhancing decision tree performance for record linkage with active learning. In Proceedings of the 2022 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1702-1711). https://doi.org/10.1145/3479029.3479103
Rokach, L., Maimon, O. (2008). Data mining with decision trees: Theory and applications (2nd ed.). World Scientific Publishing Co. Pte Ltd.
Smith, J., Johnson, R., & Thompson, M. (2020). A comparative analysis of record linkage techniques for entity resolution. Journal of Data Science, 18(3), 369-392. https://doi.org/10.6339/JDS.2020.18.3.369
Su, S., Xiao, Y., & Wang, H. (2021). Performance evaluation of a proposed machine learning model for chronic disease datasets using an integrated attribute evaluator and an improved decision tree classifier. Diagnostics, 11(2), 222. https://doi.org/10.3390/diagnostics11020222
Wang, J., Pei, J., & Zhang, Y. (2022). Link quality assessment for decision tree-based record linkage. Knowledge and Information Systems, 64(3), 1138-1158. https://doi.org/10.1016/j.ksem.2021.108222
Zhang, W., Fan, X., & Wu, X. (2021, November 18). Cost-sensitive decision tree learning for record linkage. arXiv. https://arxiv.org/abs/2111.09042
Zhang, W., Fan, X., & Wu, X. (2020). Record linkage with decision trees and blocking techniques for web data. In Proceedings of the 2020 International Conference on Big Data and Smart Applications (pp. 1-6). Association for Computing Machinery. https://doi.org/10.1145/3429244.3429252

  • Receive Date 16 December 2023
  • Revise Date 15 May 2024
  • Accept Date 02 June 2024