US and COVID Analysis

The datasets used in this project are important for our exploration of the effect of COVID-19 vaccinations on death by state as we continue to navigate this pandemic. The datasets provide us with detailed information about total deaths, causes of death, cases, population, and completed vaccinations for individuals ages 12 and up as well as total cases and recoveries which are important factors that influence death rates. Studying each of these variables is essential to understanding not only our variable of interest but the factors that influence it. We expect to see a lower number of deaths in states with higher vaccination rates and more deaths in older individuals. Furthermore, we expect to see a relationship between states and their percentage of COVID-19
We performed PAM clustering on variables Total Cases, Total Deaths, and meanpplfullyvacc to investigate the correlations between them and to group observations based on their distances.
The average silhouette width was 0.78, and the graph shows that the goodness of fit is achieved when there is only 2 clusters.
Using the function PAM, we found that observation with ID 802 is at the center of cluster 1 and observation 963 is at the center of cluster 2. We have reduced the dimensions using PCA. The cluster plot shows that the first dimension of our first Principle Component represents 74% of the data, and the second dimension represents about 25% of the data. The cumulative percentage of explained variance for PC1 and PC2 shows that we recovered 99.4% of information. This means that this represents our data adequately, despite reducing the dimensions.
Then, we saved the cluster assignment as a column in order to further analyze our clusters. After scaling our data, we found the mean of each variable for each cluster and found that the values in cluster 1 were below average yet close to the average. In cluster 2, the values were much higher above the average.
Lastly, we visualized the clusters using ggpairs(). The graph shows us that the two variables Total Deaths and meanpplfullyvacc have a dramatically different distribution for cluster 1 compared to cluster 2. Among all 3 variables, cluster 2 tends to have higher variability compared to cluster 1
A principal components analysis was performed on “Total Cases”, “Population”, and “casesperpop” from the dataset “USA_Covid_Data”. Used a prcomp to find the three principal components which were PC1, PC2, and PC3. PC1 and PC2 were extracted and bound by using the function cbind. To visualize PC1 and PC2 the fviz_pca_biplot function was used to create a biplot showing that as an observation has a higher value for PC1 it has a higher total cases and population size with a lower number of casesperpop. As an observation has a higher value for PC2 there is no real difference between Total cases and population size with a higher value of casesperpop. The higher the PC score the more variability it accounts for in the data. PC1 explains for most of the variability in the data, then PC2, and PC3 accounts for the least variability.
We found that Total Tests is not a good predictor for predicting gender in our dataset because we have a pretty bad AUC value and the mean across all folds is also around the same.(0.48). When doing the test on a sample in the first part with both the train and test set and reaffirming it in the later portion with k-folds and a more explanatory average across each fold, we get similar AUC values around 0.5 which is considered very bad. This can be expected since testing is not inherently dependent on gender and COVID does not impact on sex more than the other. Testing is not limited by biological sex differences.

Summary

PAM Clustering

Dimenson Reduction

Classification and Cross-Validation

Texas Utility Help Program Analysis

Air Quality and State Median Home Price

samrudrava@gmail.com