Covid19 and Data Science
Data mining is determining a large amount of data by finding inaccuracies, patterns, and relationships to predict the outcomes. We can use this information to increase the business's efficiency, revenues, and profit and reduce risks. In the task, we used the cross-industry standard process for the data mining process, which is also known as CRISP-DM. It is a process that describes the life cycle of data science projects. This process is beneficial and can help us to organize, plan, and implement our strategies for our data science projects. This process contains six sequential phases for preparing a model: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The business understanding describes the needs and requirements of the business. In data understanding, we have to check the accuracy of the data. In data preparation, we have to prepare our data to analyze it for modeling. Using historical data in data modeling, we can make models to analyze the data. In data evaluation and deployment, we will check which model meets the business requirements and how we can access the results.
We can use jupyter notebook tool to analyze the dataset. The sample data chosen was the covid-19 and data science. The data is used to predict the contamination of the disease in the future by using historical data, trends, and patterns.
In business understanding, we defined the business requirements. In business understanding, we found that the data contains some null values, which we removed and checked the accuracy of the dataset to get better results. Business requirements include checking the accuracy of the data, analyzing the high-risk areas to prevent contamination, need to resume the offices in the green zone areas, predict future contamination. Using data science, we can easily collect data showing the pandemic conditions. We use the data to calculate the pandemic conditions worldwide. We will be able to analyze the trends and patterns of the pandemic by using the data which help us to calculate the total number of cases in each country. In business understanding, we must ensure that the data is correct to get the correct prediction results. In business understanding, we analyze the business requirements.
In the data discovery and understanding phase, we chose the sample dataset from the kaggle website and made some changes to explore the new results. We have chosen the covid-19 and data science topic in which we performed the steps to analyze the dataset. We all know that covid-19 causes various health problems. Due to covid-19, many deaths have happened all over the world. So the dataset has been chosen to resolve the issues. The dataset contains the covid19 details of each state of India country. The attributes show the total number of confirmed, cured, and active cases with the deaths of each state. It also contains the negative and positive test results of the people. In data discovery and understanding, we followed the steps such as data collection, cleaning, data analysis, and predictive analysis. In the data cleaning process, we checked and cleaned the dataset by removing the missing values, outliers, hidden values, null values, inaccurate data, and many more. After analyzing the data, we organized the dataset for modeling.
We can import some basic libraries like numpy, pandas, seaborn and matplotlib and then we can import our dataset.
Now we will check if there is any null values in the dataset. We found that there are null values in the columns named "Negative" and "Positive" and then we can fill those null values by 0.
Now we will check for unique values and make sure every name is unique to avoid mistakes. We found that there are some states which are some states which are named in different ways so we remove all the * after the names and make them unique.
Below are the visualizations for doing data analysis:
Number of deaths, cured, confirmed and active ratio
Figure 1 shows the bar graph which represent the total number of confirmed, cured, deaths, and active cases. The graph showed that the deaths and confirmed cases are less than the number of cured and active states.
Top Five Contaminated States Graph
Figure 2 shows the bar plot showing the top five states with high-risk areas where the confirmed cases are high. This statistical representation was used to calculate the total cases for each high-risk state to reduce contamination by analyzing the areas. Using the model, we found that the top five highly contaminated states are Delhi, Maharashtra, Gujarat, Haryana, and Jharkhand. After analyzing the dataset, we found that this model best meets the business requirements as we can find the red zone areas.
Figure 3 shows the death ratio of the top five contaminated states. We can find that even though Delhi state has high cases, the number of deaths of people is lesser. We found that Maharashtra has the highest death rates. Using this model, we can make more hospitals and health care arrangements to provide the proper treatments to the people to reduce deaths. After analyzing the model, we found that this model is helpful for the business to reduce risks.
Figure 4 shows the cure ratio of the top five contaminated states. We found that even though Maharashtra has the highest death rates, the number of cure ratio is also higher as compared to other states.
Confusion Matrix and accuracy result
Figure 5 shows the confusion matrix by which we can check the accuracy of the models. A confusion matrix is used to calculate the accuracy of the models. After analyzing the models, we found that the accuracy is eighty-two percent. By analyzing the models, we found that all these models meet the business requirements.
In the evaluation phase, we must check for the model that best meets the business requirements. This phase is looking for the three tasks. In the first task, we have to evaluate the results by checking whether the models can fulfill the business requirements and, if yes, whether we can approve the models. In the second task, we have to review all the steps we followed during the process. In the third task, we must determine whether we can proceed with the models to the deployment. We can check again for the new project if the models meet the business requirements. So to evaluate the models, we have to check the models which best meet the requirements to find the accuracy of the data.
In the deployment phase, we must ensure how businesses use the results. We all know that the model is valid only if it is accurate and can be accessed by the business. This phase contains four tasks. In the first task, we have to plan the deployment. We have to make a plan for deploying the model. In the second task, we can make a plan to maintain and develop a model. In the third stage, we can produce the final reports of the data mining results. We can review the project and analyze the results in the final stage. In the deployment phase, we will check that the model is accurate and ready for deployment. For this, we will check for our statistical model, and then we will monitor the model. If everything goes well, we can make the project's final presentation and show the data mining results. According to the results, we found that various states are in the red zone, and we must reduce the contamination by warning them. Through this step, we can do a project and see the results. The model best fits to fulfill the business requirements.
After analyzing all the steps, the model best fits the business needs. By analyzing the graphs, we can see the count of people who died due to the pandemic, active cases, and positive cases in each area. This analysis is beneficial in reducing the risk to people's lives. Using the data modeling, we will get the contaminated areas in each state through which we can find the green, red, and orange zones. The green zone is the safest place where there is no contamination of the covid-19 disease. The red zone is the danger zone where the contamination is increasing at high rates, and we can call those areas high-risk. We can do a lockdown to the red zone areas to avoid contamination. In the orange zone areas, we can warn people to remain safe at home. We can resume offices and call people safely in the green zone areas. We can analyze the future conditions to resume work. With the help of this process, we can analyze the trends and patterns of the disease. In high-risk areas, we can increase the vaccination process to reduce contamination. This model provides the overall condition of the pandemic and will be very helpful for businesses to resume work in low-risk areas. Through this analysis, we can save many lives.
You can download the dataset for practice by using the link below-
Statewise testing details-
After using the dataset and following the steps given above, you can easily make predictions by using the historical data.
Thank You and Happy Learning!