1 پژوهشگاه علوم و فناوری اطلاعات ایران (ایرانداک) ،تهران ، ایران
2 دانشگاه صنعتی امیرکبیر،تهران ، ایران
عنوان مقاله [English]
Research information databases and search engines are one of the main resources used by researchers every day. To accurately retrieve information from these databases, data need to be stored correctly. Manual controlling of data quality is costly and time-consuming. Here we suggest data mining methods for controlling the quality of a research database. To this end, common errors that are seen in a database should be collected. Metadata of every record in addition to its error codes is saved in a dataset. Statistics and data mining methods are applied to this dataset and patterns of errors and their relationships are discovered. Here we considered Iran's scientific information database (Ganj) as a case study. Experts defined 59 errors. Intimate features of every record, such as its subject, authors' names and name of the university, with its error codes were saved in a dataset. The dataset containing 41021 records was formed. Statistics methods and association rules were applied to the dataset and the relationship between errors and their pattern of repetition was discovered. Based on our results, in average by considering 25 % of errors in every subject, up to 80% of errors of all the records in a subject are covered. All the records were also clustered using K-means clustering. Although there was some similarity between records of different subjects, there was not seen any evident relationship between the pattern of repetition of the errors and the subject of records.