Predicting Kereh River's Water Quality: A comparative study of machine learning models


  • Norashikin Nasaruddin Faculty of Computer and Mathematical Sciences, University Teknologi MARA, Kedah Branch, 08400 Merbok, Kedah, Malaysia
  • Afida Ahmad Faculty of Computer and Mathematical Sciences, University Teknologi MARA, Kedah Branch, 08400 Merbok, Kedah, Malaysia
  • Shahida Farhan Zakaria Faculty of Computer and Mathematical Sciences, University Teknologi MARA, Kedah Branch, 08400 Merbok, Kedah, Malaysia
  • Ahmad Zia Ul-Saufie Faculty of Computer and Mathematical Sciences, University Teknologi MARA, Shah Alam 40450, Selangor, Malaysia
  • Mohamed Syazwan Osman Faculty of Chemical Engineering, University Teknologi MARA, Penang Branch, 14300 Penang, Malaysia.



Water Quality, Machine Learning , Decision Tree , Random Forest


This study introduces a machine learning-based approach to forecast the water quality of the Kereh River and categorize it into 'polluted' or 'slightly polluted' classifications. This work employed three machine learning algorithms: decision tree, random forests (RF), and boosted regression tree, leveraging data spanning from 2010 to 2019. Through comparative analysis, the RF model emerged as the most efficient, boasting an accuracy of 97.30%, sensitivity of 100.00%, specificity of 94.74%, and precision of 95.00%. Notably, the RF model identified dissolved oxygen (DO) as the paramount variable influencing water quality predictions.


Ali Khan, M., Izhar Shah, M., Faisal Javed, M., Ijaz Khan, M., Rasheed, S., El-Shorbagy, M. A., Roshdy El-Zahar, E., & Malik, M. Y. (2022). Application of random forest for modeling of surface water salinity. Ain Shams Engineering Journal, 13(4). DOI:

Alnuwaiser, M. A., Javed, M. F., Khan, M. I., Ahmed, M. W., & Galal, A. M. (2022). Support vector regression and ANN approach for predicting groundwater quality. Journal of the Indian Chemical Society, 99(7), 100538. DOI:

Behrouz, M. S., Yazdi, M. N., & Sample, D. J. (2022). Using Random Forest, a machine learning approach to predict nitrogen, phosphorus, and sediment event mean concentrations in urban runoff. Journal of Environmental Management, 317, 115412. DOI:

Blagus, R., & Lusa, L. (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14. DOI:

Bui, D. T., Khosravi, K., Tiefenbacher, J., Nguyen, H., & Kazakis, N. (2020). Improving prediction of water quality indices using novel hybrid machine-learning algorithms. Science of the Total Environment, 721. DOI:

Dermawan, A. (2021, February 4), Main cause of Sg Kreh pollution? Pig farming activities in Kg Selamat, say NGOs, (Accessed: 22 October 2022)

Elith, J., Leathwick, J.R., Hastie, T. ( 2008). A working guide to boosted regression trees. Journal of Animal Ecology 77, 802–813.. doi:10.1111/j.1365-2656.2008.01390.x DOI:

Myers, K. D., Knowles, J. W., Staszak, D., Shapiro, M. D., Howard, W., Yadava, M., Rader, D. J. (2019). Precision screening for familial hypercholesterolemia: a machine learning study applied to electronic health encounter data. The Lancet Digital Health. doi:10.1016/s2589-7500(19)30150-5 DOI:

Gasim, M. B., Al-Badaii, F., & Shuhaimi-Othman, M. (2013). Water Quality Assessment of the Semenyih River, Selangor, Malaysia. Journal of Chemistry, 2013, 871056. DOI:

Gazzaz, N. M., Yusoff, M. K., Aris, A. Z., Juahir, H., & Ramli, M. F. (2012). Artificial neural network modeling of the water quality index for Kinta River (Malaysia) using water quality variables as predictors. Marine Pollution Bulletin, 64(11), 2409–2420. DOI:

Hastie, T., Tibshirani, R., & Friedman, J. (2011). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) (9780387848570): Trevor Hastie, Robert Tibshirani, Jerome Friedman: Books. In The elements of statistical learning: data mining, inference, and prediction.

Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8). DOI:

Jeung, M., Baek, S., Beom, J., Cho, K. H., Her, Y., & Yoon, K. (2019). Evaluation of random forest and regression tree methods for estimation of mass first flush ratio in urban catchments. Journal of Hydrology, 575. DOI:

Lee Goi, C. (2020). The river water quality before and during the Movement Control Order (MCO) in Malaysia. Case Studies in Chemical and Environmental Engineering, 2. DOI:

Liao, H., & Sun, W. (2010a). Forecasting and evaluating water quality of Chao Lake based on an improved decision tree method. Procedia Environmental Sciences, 2. DOI:

Lu, H., & Ma, X. (2020). Hybrid decision tree-based machine learning models for short-term water quality prediction. Chemosphere, 249, 126169. DOI:

Malek, N. H. A., Yaacob, W. F. W., Nasir, S. A. M., & Shaadan, N. (2022). Prediction of Water Quality Classification of the Kelantan River Basin, Malaysia, Using Machine Learning Techniques. Water (Switzerland), 14(7)., Ministry of Environment and Water. (2020). DOI:

Motevalli, A., Naghibi, S. A., Hashemi, H., Berndtsson, R., Pradhan, B., & Gholami, V. (2019). Inverse method using boosted regression tree and k-nearest neighbor

to quantify effects of point and non-point source nitrate pollution in groundwater. Journal of Cleaner Production 228, 1248-1263.

Shamsuddin, I.I., Othman, Z., & Sani, N.S. (2022). Water Quality Index Classification Based on Machine Learning: A Case from the Langat River Basin Model. Water. DOI:

Virro, H., Kmoch, A., Vainu, M., & Uuemaa, E. (2022). Random forest-based modeling of stream nutrients at national level in a data-scarce region. Science of The Total Environment, 840, 156613. DOI:

Shaziayani, W. N., Ul-Saufie, A. Z., Mutalib, S., Mohamad Noor, N., & Zainordin, N. S. (2022). Classification Prediction of PM10 Concentration Using a Tree-Based Machine Learning Approach. Atmosphere, 13(4). DOI:

Uyun, S., & Sulistyowati, E. (2020). Feature selection for multiple water quality status: Integrated bootstrapping and SMOTE approach in imbalance classes. International Journal of Electrical and Computer Engineering, 10(4). DOI:




How to Cite

Nasaruddin, N., Ahmad, A., Zakaria, S. F., Ul-Saufie, A. Z., & Osman, M. S. (2023). Predicting Kereh River’s Water Quality: A comparative study of machine learning models. Environment-Behaviour Proceedings Journal, 8(SI15), 213–219.