نویسندگان
1 گروه مهندسی کامپیوتر؛ واحد مشهد؛ دانشگاه آزاد اسلامی؛ مشهد، ایران
2 گروه مهندسی کامپیوتر؛ واحد قوچان؛ دانشگاه آزاد اسلامی؛ قوچان، ایران؛
3 گروه علم اطلاعات و دانششناسی؛ دانشگاه فردوسی مشهد؛ ایران
چکیده
کلیدواژهها
عنوان مقاله [English]
نویسندگان [English]
Improvement in information retrieval performance relates to the method of knowledge extraction from large amounts of text information on web. Text classification is a way of knowledge extraction with supervised machine learning methods. This paper proposed Kullback-Leibler divergence KNN for classifying extracted features based on term weighting with Latent Dirichlet Allocation algorithm. LDA is Non-Negative matrix factorization method proposed for topic modeling and dimension reduction of high dimensional feature space. In traditional LDA, each component value is assigned using the information retrieval Term Frequency measure. While this weighting method seems very appropriate for information retrieval, it is not clear that it is the best choice for text classification problems. Actually, this weighting method does not leverage the information implicitly contained in the categorization task to represent documents. In this paper, we introduce a new weighting method based on Point wise Mutual Information for accessing the importance of a word for a specific latent concept, then each document classified based on probability distribution over the latent topics. Experimental result investigated when we used Pointwise Mutual Information measure for term weighing and K Nearest Neighbor with Kullback-Leibler distance for classification, accuracy has been 82.5%, with the same accuracy versus probabilistic deep learning methods.
کلیدواژهها [English]