As we discussed the CDC datasets into the class today, my work towards the project has come to the analysis of the data from the different counties in U.S. and the correlation between the %diabetes and %inactivity. Moreover, for the further analysis of the data and to understand the residual and scatter plot I have gone through the concepts of the pandas, numpy, and matplotlib by using the tool jupyter notebook.
In result, I write a code to analyze the data of diabetic for the U.S. counties by using the command df.info() to get the essential details of the data frame. In that case I found the Mean is around 8.719796 and Standard Deviation is approximately 1.794854. Also, I used the command df.describe() to get the statistical information about the numeric columns within the data frame. While doing the analysis the value for minimum is 3.8, the value for first quartile is 7.3, the value for median is 8.4, the value for third quartile is 9.7, and the maximum is 17.9. Furthermore, I have calculated the value of skewness and kurtosis which is around 0.97444944492189790 and 1.0317351879435321 respectively.
Note: The above information is only about the diabetic dataset.
In addition, today in the class I have learned a new term i.e. p-value by which we can formulate hypotheses in defining null hypotheses (H0) and alternative hypotheses (Ha or H1). It is basically a statistical metric used to assess the strength of the evidence refuting a null hypothesis in hypothesis testing.
Project1DataAnalysis