On the Cluster Validity Test (s) in Unsupervised Machine Learning TDA Approach for Atmospheric River Patterns on Flood Detection in Nigeria
F. O. Ohanubaa,b*, M. T. Ismaila
, M. K. Majahar Alia , E. Alihc
and P.N. Ezrab
aSchool of Mathematical Sciences, Universiti Sains Malaysia,11800 Penang, Malaysia;
bFaculty of Physical Sciences, Department of Statistics, University of Nigeria, Nsukka, Nigeria;
cDepartment of Mathematics and Statistics, Federal Polytechnic, Idah, Kogi State
Correspondence details: Felix Obi, Mohd Tahir Ismail and, Majid Khan Majahar Ali
TDA (i.e., Topological Data Analysis) has recently been a reliable and current research area in
Statistics for extracting shape (information) from data. In this study, the researchers proposed an
automated method that uses TDA & ML in identifying floods (ARs) in big data. Our process gives
vital details on time series trends, which help mitigate the negative effect of ARs, such as flooding.
The spatial data (between 1970 - 2018) from Nigeria Hydrological Services Agency (NIHSA) on
four weather parameters were used. The daily datasets were converted to monthly datasets before
the proposed method was applied. Python Software is used to develop code in the implementation
of our process. Mostly, the outcome facts studied will drastically reduce disasters due to extreme
events like floods and achieve some SDG goals related to the flood. The second objective is to
identify potential flooding and no flooding in each zone. The work successfully used a real dataset
and four variables that other studies have not used to fill a gap. After our model's training process,
we obtained the best group at k = 2, where we have the highest Silhouette coefficient in each of the
seven states. We have found a reasonable structure in the study considering the total average range
(0.3 - 0.8). That gives an efficiency outcome of approximately 80%. Summary of clustered feature
pattern shows the potential flood zone and no flood zone. We conducted cluster validity of our
results using R software codes and, the test validated the best group at the same cluster k = 2. The
Gap statistic shows efficiency ranging between 65% to 80% in the seven states. We found from
figure 11 that only the Silhouette plot obtained optimal values at exactly k = 2; The researchers
got the extent of the spread from the centroid using Excel software.
Keywords: clustering; extreme climate; flood menace; machine learning; topology; big data; sustainable development goal (SDG)
Felix O. Ohanuba is a PhD student, School of Mathematical Science, Universiti Sains Malaysia, (email:
*felix.ohanuba@student.usm.my); Mohd T. Ismail is a Professor of Statistics, School of Mathematical
Science, Universiti Sains Malaysia (email: m.tahir@usm.my), Majid K. Majahar Ali is a Senior Lecturer,
School of Mathematical Science, Universiti Sains Malaysia (email: majidkhanmajaharali@usm.my);
Ekele Alih is a Senior Lecturer, Department of Mathematics and Statistics, Federal Polytechnic, Idah,
Kogi State (ekelson200@yahoo.com). Precious N. Ezra is a Lecturer, Faculty of Physical Sciences, Department of Statistics, University of Nigeria, Nsukka, Nigeria (email: precious.ezra@unn.edu.ng).