Predicting COVID- 19 Incidence Using Data Mining Techniques: A case study of Pakistan

4 Department of Computer Science, Lahore Garrison University, Lahore, Pakistan Abstract: The Outbreak of Coronavirus (COVID-19) came to the world in early December 2019. The early cases of coronavirus were reported in Wuhan City, Hubei Province, China. Till May 18, 2020, 198 countries have been affected by this life-threatening disease. The most common and known traits of COVID-19 are tiredness, fever, and dry cough. In this paper, we have discussed the Predictive data mining approach for COVID-19 predictions. In Predictive data mining, a model is developed and trained using supervised learning and then it predicts the behavior of provided data. Predictive data mining is a renowned technique known to many health organizations for the classification and prediction of diseases such as Heart disease and various types of cancers etc. There are several factors for comparing the model's accuracy, scalability, and interpretability. This predictive model is compared to the basics of its accuracy. In this proposed approach, we have used WEKA as it provides a vast collection of many machine learning algorithms. The main objective of this paper is to forecast the possible future incidence of corona cases in Pakistan. This study concludes that the number of corona cases will increase swiftly. If the government take proactive steps and strictly implement precautionary measures, then Pakistan may be able to overcome this pandemic.


Introduction
The disease of COVID-19 was originated in Wuhan city of province Hubei, China. Chunyun, the days of mass migration for the yearly Spring Festival. To limit the spread of COVID-19, Chinese authorities adopted an extraordinary approach on 23 January 2020. These guidelines comprised of national-wide quarantine, Limited and strict traveling policies, and vast surveillance of covid-19 alleged cases.
The Covid-19 was confirmed to reach Pakistan on February 26, 2020, when a student returning from Iran tested positive. By March 18, Cases of COVID-19 has been reported all across the country, as of June 04, 2020, there have been about 85264 confirmed cases with 28923 recoveries and 1688 deaths in the country.
The aim of this is study mainly centered on forecasting the breakout trends in Pakistan base on the breakout pattern in China, Spain as the density of population is dense in these countries. We wanted to illustrate, how these precautionary measures restricted the outbreak.

Current Condition in Pakistan
According to the Ministry of Health, Government of Pakistan, the total number of confirmed cases is 85264 and 1770 deaths on Thursday, June 06, 2020. Punjab province is most affected with confirmed cases (31104), then the province of Sindh with confirmed cases (32910), province of Khyber Pakhtunkhwa (11373), province of Baluchistan (5224), Gilgit Baltistan (824), Federal city (3054) and Kashmir with total cases of 285. The results are shown in Table 1. The overall Covid-19 case history of Pakistan is as in Figure1.

Method
The predictive model is based on a time-series cumulative dataset of coronavirus confirmed, recovered, and mortalities.

Definitions
S.  defines corona disease as Coronavirus disease (COVID-19) is a transferrable disease spread by a recently discovered coronavirus. A positive case of COVID-19 infection was defined as a case with a positive result for viral nucleic acid testing in respiratory specimens. A suspected case can be defined as a case with symptoms of COVID-19 infection but not confirmed by viral nucleic acid testing.

Dataset
Datasets are collected from Humandata.org. That track Global-time series data. The extracted data for the model is updated to June 04, 2020.

Model Development
Forecasting the COVID-19 incidence in Pakistan, the Linear regression-based approach is mainly focused on comparing the performance of Model RMSE, and MAE is proposed. BRAIN. Broad Research in December, 2020 Artificial Intelligence and Neuroscience Volume 11, Issue 4

Literature Review
Twenty kinds of literature were reviewed for this study. The objective of this literature review is to discover how different models perform according to the given scenarios. In  presented a study, predicting the epidemic trend of Coronavirus in China. They used a Modified SEIR model with an AI approach trained in the late 2003 SARS dataset, to predict the outbreak. This study concludes that the breakout trend will start to decline by end of April. In (S.  proposed a study to estimate the reproductive cases of COVID-19 to predict daily cases on Diamond Princess cruise ship. They used serial interval distribution of existing daily incidence and estimate reproductive numbers of COVID-19 based on approximately Poisson distribution. The outcome of this study states that the number of new cases will gradually increase and cumulative COVID-19 cases may reach 1514 in the next ten days. The paper by (Binti Hamzah et al., 2020) developed an online tracker for daily statistics and analysis of corona cases, this paper aims to forecast the active confirmed, recovered cases of COVID-19 within and outside of China. This study uses Susceptible-Exposed-Infectious-Recovered (SEIR) as a predictive model. His study concludes that the peak of the outbreak will reach in late May 2020 with cases exceeding 76000 and start to decline in early July 2020. Authors in Qasim et al., (2020) use a mathematical model, sequence mean weight (TSMW) to predict COVID-19 cases across Pakistan. This model finds out that the count of patients may reach 77,905 with at least 8285 confirmed cases and 1382 death in the next 45 days, till 29 th April 2020. Autoregressive-moving-average model (ARMA) is a hypothesis-based testing model firstly proposed in 1951; it is mainly used for un-stationary time-series data. This same heuristic can be used in Simulation modeling (L.  in which an earlier digital prototype is developed to analyze the performance of this model before deploying it. In Simulation modeling, the heuristic can be used for digital prototyping. This study represented a baseline of the transmission process of COVID-19 by using a new model based on Gaussian distribution theory. This paper finds out the key factors of virus spread, such as the incubation period of the virus, reproductive number, and daily infections. The study (Y. ) developed a dynamic time series model to forecast the short term trend of COVID-19 spread inside China. The model is based on different mathematical formulas. This study concludes that in China, total cases of coronavirus may reach to 36,343 after one week (February 8,2020). Wuhan will peak its confirmed cases on March 2020. After which the infection rate will start decline throughout China. This study ignore some factors that can impact the result, factors such as birth rate or natural deaths Artificial intelligence with another predicting model can provide more realistic figures as . Table 2 below provides summary on the literature review.

Statement of the Problem
On February 26, 2020, Pakistan registered its first Covid-19 case, and on March 25, 2020, Pakistan confirmed its first death in Lahore due to Covid-19.
Since from February 26, Covid-19 Outbreak keep its spread in Pakistan, and as the government of Pakistan lifted most of the lockdown, It is essential to predict what Pakistan affords this ease. Moreover, it can be able to overcome this pandemic, or it will become another America or Wuhan!

Objectives of the Study
The main objective of this study is to predict Coronavirus incidence and its trends across the different regions of the country by using an Efficient and Strong Model.
• Analysing the current trend of Covid-19 in Pakistan.
• Developing a reliable predictive model • Predicting the coronavirus cases using Linear Regression.

Methodology
Based on the gathered outbreak data, this model tried to discover the transmission rule of the coronavirus, forecast the breakout situation. The dataset of Coronavirus is collected from humadata.org and verified through figures provided by the Ministry of Health of Pakistan. To carry out this prediction, Weka, a tool of data mining which was developed by The University of Waikato, New Zealand. Weka applies different algorithms on datasets and provides results. There are four major phases of this model. In the first phase data pre-processing and data, transformation is carried out. The second phase of this study comprises of model training, in which Linear Regression as a forecasting algorithm is used. During the training of the model, cumulative confirmed, recovered and mortalities cases area fed as the dependent variable and time-series data variable as the independent variable.
Linear regression plots straight lines on scatter graph, so the possibility of outliers is minimum, but it has been observed that in daily cases, total daily cases decline or incline against the plotted curves causing outliers. In such regard, it is better to use mode instead of median as it provides a Real-time accurate average. The third phase of models validates the accuracy of the model, in which RSME and MAE are considered. The fourth and last phases provide results and forecasts for the next 58 days. Figure 2 represents a detailed overview of Proposed Methodology with all these four phases of the model as:

Data-Transformation:
Data pre-processing provides the national data set required by the model. However, as Weka is being used as a data mining tool for this model, the default format for data is. arff, so after extracting meaningful data for this model, data transformation is applied in which data is transformed from CSV to arff.
Training and evaluation of data: For training and testing of this model, 80% of data are used for training and 20% for evaluation. BRAIN. Broad Research in December, 2020 Artificial Intelligence and Neuroscience Volume 11, Issue 4 177 Model Training: The main objective of training the model is for learning so it can generate and predict. This model uses Linear regression for forecasting. 80% of data is feed to the model, and 20% of data is used for its performance and efficiency evaluation.

Validation of Model:
To evaluate the accuracy of the Model RMSE is mainly targeted.

Results
This is a dynamic predictive model that can predict data changes, and the model can predict cases on a daily or weekly basis. This model works with three datasets and predicts cumulative, confirmed cases of patients infected with Covid-19, death toll, and recovered cases. As the number of confirmed cases increases rapidly in the last three weeks and from Figure 3 generated by this model concludes that the confirmed cases in Pakistan are expected to increase rapidly. This model expects that the number of coronaviruses confirmed cases till July 31, 2020, will get a rapid rate with no effective policy to encourage masses for social distancing and other safety measures. This model also predicts recovered cases and deaths for this period. The recovery rate will also swiftly increase with nearly 90,000 recoveries, in figure 4. From Figure 5, the death toll for this period is expected to remain at 12,000.

Evaluation of results
Evaluation of Model is performed using very well-known measures of accuracy, mean absolute error (MAE), and Root means square error (RMSE). The 58-day evaluation of predicted cases is represented in Figures  6, 7, and 8.

Discussion and Limitations
In this paper, we have reviewed many datamining kinds of literature and their techniques. Susceptible-Exposed-Infectious-Recovered Model (SIER) is a mathematical model that describes circumstances in which an individual with an infectious disease becomes a source of infection for others. As from the name, this model has four stages, with parameter β (beta), which controls the rate of spread, α (alpha) incubation rate, and γ (gamma), which is the recovery rate. This technique base on the SIR model. This technique forecast future events Just like COVID-19, which spreads through close contact of infected masses.
In early 2020, one of the biggest pandemics of the 21 st century came to light with many mortalities and infected cases. The rate of transfer of this virus was very fast in both developed or underdeveloped countries, and it was essential to predict and analyze this rapid spread; till now, many researchers have proposed many techniques and models. The forecast has relatively the lowest prediction error as it used machine learning algorithms for the prediction of corona cases. Machine learning is one of the significant developments of the last ten years. It is an application of Artificial intelligence in which we train the machine by providing available data, and machines can then use artificial neural networks upon some pattern to BRAIN. Broad Research in December, 2020 Artificial Intelligence and Neuroscience Volume 11, Issue 4 provide results. Similarly, one of the most critical applications of AI is deep learning; it is a more advanced version of machine learning. The concepts of deep learning were proposed in the early 2000s, but the breakthrough in deep learning came after the winter of AI in 2010.
As from the name, this model has four stages, with parameter β (beta), which controls the rate of spread, α (alpha) incubation rate, and γ (gamma), which is the recovery rate. This technique base on the SIR model. This technique forecast future events Just like COVID-19, which spreads through close contact of infected masses.
The Heuristic model use some early profit estimation techniques, find the best cost-effective solutions; completeness is not guaranteed at some point where backtracking is not possible or not be efficient.
This study is not considering any Social and economic factors such as Educational, economic, or relational beliefs. These factors may affect the spread.

Conclusion
The pandemic of coronavirus came to the world in early January 2019 and till June 04, 2020, almost the whole world is affected by it. This virus is closely related to bat coronaviruses causing COVID-19 disease. As explained earlier there are many known symptoms of this disease such as tiredness, fever, and dry cough. The disease of COVID-19 spread exponentially causing a rapid infection rate. This infection rate is even faster in the Third world and highly populated countries like India, Bangladesh Pakistan, etc, and the current situation of Pakistan is not satisfactory as the infection rate continuously rising, with very limited finical and medical resources. Pakistan must take proactive measures to gain control over this pandemic and this study can help policymakers to take comprehensive action as well as necessary future needs in the health sector. We have carried out this study to find out the future trend of the situation with the help of the data mining technique of Linear regression, with the help of three different cumulative data sets of recovered, deceased, confirmed cases, and our proposed methodology. This model finds out that the infection rate will gradually increase but at the same time, it is also observed that the recovery rate will increase rapidly as compared to the death rate. In the future, we will analyze this rapid rate of recovery as compared to the death and some other social factors like public awareness and personal belief of the public regarding reality and severeness of the Coronavirus.