Research Article Open Access

Improving the Performance of Multivariate Bernoulli Model based Documents Clustering Algorithms using Transformation Techniques

Perumal Pitchandi and Nedunchezhian Raju

Abstract

Problem statement: Document clustering is the most important areas of data mining since they are very much and currently the subject of significant global research since such areas strengthen the enterprises of web intelligence, web mining, web search engine design and so forth. Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Approach: This study explores the suitability of multivariate Bernoulli model based probabilistic algorithm for text clustering application. In a multivariate Bernoulli model, a document is represented as a binary vector over the space of words with 0 and 1, indicating that whether word occurs or not in the document. The number of occurrences is not considered. So the word frequency information is lost due to this nature of implementation. In this work, we propose a FFT based transformation technique for improving clustering performance of multivariate Bernoulli model based probabilistic algorithm. We are using the transformation technique to transform the actual term frequency count data in to a time domain signal. So, the weight of frequency of each word will be distributed throughout each row of records. Now if we apply multivariate Bernoulli model on values less than zero and greater than zero, the performance will get increased since there is no information loss in this kind of data representation. Results: In this work, Bernoulli model-based clustering and an improved version of the same will be implemented and evaluated using suitable metrics and the results are shown. Conclusion: The transformation technique in multivariate Bernoulli model improves the performance of document clustering significantly.

Journal of Computer Science
Volume 7 No. 5, 2011, 762-769

DOI: https://doi.org/10.3844/jcssp.2011.762.769

Submitted On: 9 March 2011 Published On: 7 May 2011

How to Cite: Pitchandi, P. & Raju, N. (2011). Improving the Performance of Multivariate Bernoulli Model based Documents Clustering Algorithms using Transformation Techniques. Journal of Computer Science, 7(5), 762-769. https://doi.org/10.3844/jcssp.2011.762.769

  • 3,575 Views
  • 3,236 Downloads
  • 6 Citations

Download

Keywords

  • Text clustering
  • text classification
  • document clustering
  • model based clustering
  • term document matrix
  • Text to Matrix Generator (TMG)
  • Bernoulli model
  • Fast Fourier Transformation (FFT)
  • transformation technique
  • clustering algorithms