Combined Heuristic Technique for Optimization of Bloom Filter in Spam Filtering
Abstract
Problem statement: Spam is an irrelevant or inappropriate message sent on the internet to a large number of newsgroups or users. A spam word is a list of well-known words that often appear in spam mails. Bloom Filter (BF) is used for identification of spam word. Approach: BF is a simple but powerful data structure that can check membership to a static set. The trade-off to use BF is a certain configurable risk of false positives. The odds of a false positive can be made very low if the hash bitmap is sufficiently large. Bin Bloom Filter (BBF) has number of BFs which assign group of words into bins with different false positive rates based on weight of the spam words. Genetic Algorithm (GA) was employed to minimize the total membership invalidation cost of BBF. GA had premature convergence problem. Simulated Annealing (SA) was incorporated with GA to prevent the premature convergence effectively. Results: The experimental results of total membership invalidation cost are analyzed for various sizes of bins. The results showed that the combined GA-SA model outperforms SA and GA model. Conclusion: GA has premature convergence due to its genetic operators that are not able to generate offsprings which are superior to the parents. So more number of similar chromosomes presented on the population. When GA is incorporated with SA new genes were introduced which causes diversity in the population and prevents premature convergence. The combined GA-SA outperforms GA and SA.
DOI: https://doi.org/10.3844/jcssp.2011.1439.1447
Copyright: © 2011 Arulanand Natarajan, S. Subramanian and K. Premalatha. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 3,361 Views
- 2,847 Downloads
- 1 Citations
Download
Keywords
- Spam word
- hash function
- genetic algorithm
- simulated annealing
- static set
- premature convergence
- long vector
- bit array