On the Cluster Validity Test (s) in Unsupervised Machine Learning TDA Approach for Atmospheric River Patterns on Flood Detection in Nigeria

On the Cluster Validity Test (s) in Unsupervised Machine Learning TDA Approach for Atmospheric River Patterns on Flood Detection in Nigeria

F. O. Ohanubaa,b*, M. T. Ismaila , M. K. Majahar Alia , E. Alihc and P.N. Ezrab

^aSchool of Mathematical Sciences, Universiti Sains Malaysia,11800 Penang, Malaysia;
^bFaculty of Physical Sciences, Department of Statistics, University of Nigeria, Nsukka, Nigeria; ^cDepartment of Mathematics and Statistics, Federal Polytechnic, Idah, Kogi State

Correspondence details: Felix Obi, Mohd Tahir Ismail and, Majid Khan Majahar Ali

TDA (i.e., Topological Data Analysis) has recently been a reliable and current research area in Statistics for extracting shape (information) from data. In this study, the researchers proposed an automated method that uses TDA & ML in identifying floods (ARs) in big data. Our process gives vital details on time series trends, which help mitigate the negative effect of ARs, such as flooding. The spatial data (between 1970 - 2018) from Nigeria Hydrological Services Agency (NIHSA) on four weather parameters were used. The daily datasets were converted to monthly datasets before the proposed method was applied. Python Software is used to develop code in the implementation of our process. Mostly, the outcome facts studied will drastically reduce disasters due to extreme events like floods and achieve some SDG goals related to the flood. The second objective is to identify potential flooding and no flooding in each zone. The work successfully used a real dataset and four variables that other studies have not used to fill a gap. After our model's training process, we obtained the best group at k = 2, where we have the highest Silhouette coefficient in each of the seven states. We have found a reasonable structure in the study considering the total average range (0.3 - 0.8). That gives an efficiency outcome of approximately 80%. Summary of clustered feature pattern shows the potential flood zone and no flood zone. We conducted cluster validity of our results using R software codes and, the test validated the best group at the same cluster k = 2. The Gap statistic shows efficiency ranging between 65% to 80% in the seven states. We found from figure 11 that only the Silhouette plot obtained optimal values at exactly k = 2; The researchers got the extent of the spread from the centroid using Excel software. Keywords: clustering; extreme climate; flood menace; machine learning; topology; big data; sustainable development goal (SDG)

Felix O. Ohanuba is a PhD student, School of Mathematical Science, Universiti Sains Malaysia, (email: *felix.ohanuba@student.usm.my); Mohd T. Ismail is a Professor of Statistics, School of Mathematical Science, Universiti Sains Malaysia (email: m.tahir@usm.my), Majid K. Majahar Ali is a Senior Lecturer, School of Mathematical Science, Universiti Sains Malaysia (email: majidkhanmajaharali@usm.my); Ekele Alih is a Senior Lecturer, Department of Mathematics and Statistics, Federal Polytechnic, Idah, Kogi State (ekelson200@yahoo.com). Precious N. Ezra is a Lecturer, Faculty of Physical Sciences, Department of Statistics, University of Nigeria, Nsukka, Nigeria (email: precious.ezra@unn.edu.ng).

On the Cluster Validity Test (s) in Unsupervised Machine Learning TDA Approach for Atmospheric River Patterns on Flood Detection in Nigeria

Organization detail