Research Article Open Access

COMPARATIVE STUDY OF K-MEANS AND K-MEANS++ CLUSTERING ALGORITHMS ON CRIME DOMAIN

Bashar Aubaidan1, Masnizah Mohd2 and Mohammed Albared2
  • 1 , Iraq
  • 2 Universiti Kebangsaan Malaysia, Malaysia

Abstract

This study presents the results of an experimental study of two document clustering techniques which are k-means and k-means++. In particular, we compare the two main approaches in crime document clustering. The drawback of k-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a k-means++ was introduced in order to find a good initial center point. Since k-means++ has not being applied before in crime document clustering, this study presented a comparative study between k-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than k-means. We proposes the k-means++ clustering algorithm, to identify best seed for initial cluster centers in clustering crime document. The aim of this study is to conduct a comparative study of two main clustering algorithms, namely k-means and k-means++. The method of this study includes a pre-processing phase, which in turn involves tokeniza-tion, stop-words removal and stemming. In addition, we evaluate the impact of the two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. Exper-imental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, k-mean++ can significantly (with the significance interval at 95%) work better than k-means. These results demonstrate the accuracy of k-mean++ clustering algorithm in clustering crime doc-uments.

Journal of Computer Science
Volume 10 No. 7, 2014, 1197-1206

DOI: https://doi.org/10.3844/jcssp.2014.1197.1206

Submitted On: 21 November 2013 Published On: 19 February 2014

How to Cite: Aubaidan, B., Mohd, M. & Albared, M. (2014). COMPARATIVE STUDY OF K-MEANS AND K-MEANS++ CLUSTERING ALGORITHMS ON CRIME DOMAIN. Journal of Computer Science, 10(7), 1197-1206. https://doi.org/10.3844/jcssp.2014.1197.1206

  • 3,797 Views
  • 4,926 Downloads
  • 18 Citations

Download

Keywords

  • Crime Document Clustering
  • K-Means++
  • K-Means Algorithm
  • Similarity/Distance Measures