PM2.5 Prediction Using Homogenous and Heterogenous Ensemble Learning: A Comprehensive Evaluation

Shrabani Medhi; Minakshi Gogoi

doi:10.3844/jcssp.2024.931.954

Abstract

Air pollution is a global issue. PM2.5 is considered to be the most dangerous pollutant. Prediction of PM2.5 concentration is important so that effective measures can be taken beforehand. A multitude of machine learning methodologies have been employed in forecasting PM2.5 levels, utilizing diverse combinations of ensemble classifiers and regressors. However, there are three important issues that need to be addressed in order to construct ensemble classifiers and regressors. The first concern pertains to the selection of the base regressor or classifier technique. The second issue revolves around the choice of the amalgamation technique utilized to assemble multiple regressors or classifiers. Lastly, the third issue relates to determining the optimal number of regressors or classifiers to be ensembled. There is a limited number of related studies addressing these issues. We conducted a comprehensive comparative analysis of ensemble methods, including bagging and boosting for homogeneous ensemble methods and blending and super-learning (stacking) for heterogeneous ensemble methods, to predict PM2.5 concentration levels. Ensemble regressors and classifiers' performance based on these techniques has not been wholly scrutinized in the literature. The issues that we have addressed have not previously undergone scrutiny in the context of PM2.5 concentration prediction. We have used artificial neural networks, support vector machines and decision trees to construct 24 different ensemble regressors and classifiers. In constructing the decision tree, we employed the information gain approach to determine the most suitable property for each node within the generated tree. For SVM we have used the Radial Basis Function (RBF) kernel to create our models. For the ANN model we have used we have used Adam (adaptive moment estimation) optimizer. In each layer, the softmax activation function is used. We have done a model comparison using execution time, accuracy and error metrics on three air pollution datasets of Guwahati City, Delhi City and Kolkata City obtained from the central pollution control board, India. The results reveal that on average heterogenous ensemble techniques, namely, stacking (90-100%) and blending (80-100%) offer better prediction accuracy than homogenous ensemble techniques, namely, bagging (50-98%) and boosting (50-97%) over all the datasets. The root means square error reveals that heterogenous ensemble classifiers and regressors fit better as compared to homogenous classifiers and regressors. In conclusion, our findings indicate that an innovative approach to PM2.5 concentration prediction could incorporate both homogeneous and heterogeneous ensemble techniques into their algorithms. Our ethical data collection approach relies on the open dissemination of information by the central pollution control board, fostering a spirit of shared responsibility in advancing air quality research and public health initiatives

PM2.5 Prediction Using Homogenous and Heterogenous Ensemble Learning: A Comprehensive Evaluation

Abstract

Download

Keywords