D. Kelleher. Spam filtering using contextual network graphs. Master’s thesis, Computer Science Department, Trinity College, Dublin, Ireland, 2004. [pdf]
This paper present a comparaison between three differnet classification methods based on Naive-bayesian, Latent Semantic Indexing and Contextual Network Graph for the purpose of improving common spam filtering mechanisms.
The author show how LSI and CNG outperform the Naive-Bayes Classifier for the ability that those two methods have of dealing with text polysemy and synonymy. Polysemy reduces in fact the strenght of keywords by increasing the number of classes in which a polysemous word can appear. Synonymy reduces the strenght of keywords by spreading their value among a number of synonyms. The two methods above have a powerful approach to classification of natural language text because they use a context-based search that take into account the semantic links between words and search over a semantic space rather than simply a list of keywords.
The foundamental problem of the naive-bayes probabilistic approach is that it makes the assumption that each word in a document is indipendent from the others. This independence assumption makes this classifier unsuitable for natural language processing since it loses the contextual relations between words and fails to address the issues above.
The author uses its own implementation of the above methods using a public available dataset and a Stop-Word removal. The implementation was done in JAMA a Java Matrix Package.
The success of a certain method was measured using some machine learning markers: a k-fold cross validation testing mechanism.The author report how the most important results from the test was not the accuracy, but the fact that the high complexity of SVD makes its operation significantly slower compared to the other two.
Additionally, results showed than the ratio between the initial energy activation and the energy threshold of CNG had to be adjusted to maintain the accuracy while increasing the corpus size. This may be related to the fact that growing the dataset the threshold value should be adapted as suggested also by Kimoto and Iwadera (p.236): ‘this suggest that the threshold value should increase with trasversal distance.’
In conclusion the results shows that CGN should be prefered over LSI because of the computational speed and because it is not patented.