Clustering, Regression, Goodness of Fit Statistics, Spatial Lag and Error Models

Loading and Viewing the COVID Data

Data used in this analysis: COVID data in Select U.S. counties between 2020 and 2021.

This analysis utilizes R Studio and GeoDa. R code is presented in light blue boxes, while the resulting output is provided in light grey boxes.

Moran’s I

The Moran’s I scatterplot of the standardized Covid death rate variable shows a positive best-fit line with a value of 0.386, and the majority of observations fall within quadrants 1 and 3. These results indicate that spatial autocorrelation between neighboring features is present within the data. Additionally, the z-score value of 34.7559 strongly supports the rejection of the null hypothesis that values of the Covid death rate variable are randomly distributed. (Rook contiguity weights were used).

Cluster and Significance Maps

In the LISA cluster map and the Getis-Ord local G-statistic cluster map evidence of spatial autocorrelation is seen. The high-high clusters of the LISA map reside within quadrant one of the Moran’s I scatterplot and indicate positive spatial autocorrelation (Murack, 2015). The low-low clusters of the LISA map reside within quadrant three of the Moran’s I scatterplot and indicate positive spatial autocorrelation. The LISA map’s high-low and low-high clusters represent locations within quadrants 2 and 4 of the Moran’s I scatterplot and indicate negative spatial autocorrelation or outliers. The G-statistic map shows areas of the country with a concentration of higher or lower Covid death rates, indicating positive or negative spatial autocorrelation, respectively (Rigby, 2023). Both maps suggest that Covid death rates are clustered and do not align with a random spatial pattern.

Regression

Regression analysis was conducted for the dependent variable of Covid death rates and the independent variables of share of the population that is of color, the share of the population that smokes, the share of the population in poverty, the share of the population considered obese, the share of the population classified as elderly, and share of the population registered as democratic voters (note all variables are normalized).

Partial Regression Coefficients

The results show that for every one unit of increase in the share of the population that smokes, Covid death rates decrease by 0.0194318 of a unit. For every one-unit increase in the share of the population in poverty, Covid death rates increase by 0.117537 of a unit. For every one-unit increase in the share of the population that identifies as a person of color, Covid death rates increase by 0.28321 of a unit. For every one-unit increase in the share of the population that is considered obese, Covid death rates increase by 0.148325 of a unit. For every one-unit increase in the share of the population aged sixty-five or older, Covid death rates increase by 0.127995 per unit. For every one-unit increase in the share of the population registered as a democratic voter, Covid death rates decrease by 0.168049 of a unit.

Goodness of Fit Statistics

The role of the goodness of fit statistics are to signify how adequately the regression model explains Covid death rate variation (Rigby, 2023a).

R2. One measure of goodness of fit is the coefficient of determination, commonly represented as R2, which “measures the proportion of the total variation in the dependent variable (Y) explained by the regression model” (Rigby, 2023a). The R2 value for this regression is 0.143199 (with an adjusted R2 value of 0.141541) which indicates that this model, the independent variables, explain about 14% of the variation in the Covid death rate dependent variable.

Residual Standard Error. The residual standard error is calculated as the “standard deviation of the residuals about the regression line” (Rigby, 2023b). The residual standard error value of 0.9227 for this regression indicates that the average error of the model is small.

T* Test and P-Values. The t-test statistics and p-values show that all variables are statistically significant and contain expected directionality except for the percentage of smokers in the population variable, which is neither significant nor the expected directionality. From previous theory and research findings, it is expected that increases in the shares of poverty, smoking, people of color, obesity, and people sixty-five and older in a population would result in increased Covid death rates. In contrast, increases in the share of Democratic voters in a population would decrease Covid death rates. However, the smoking variable produced results that do not align with this expectation. As a result of the significance outcomes, all null hypotheses (“that the sample could have been drawn from a population where the population slope coefficient was equal to zero”) would be rejected except for the smoker variable (Rigby, 2023b).

F-Statistic. The F-statistic value of 86.3793 indicates “that at least one of the slope coefficients is significantly different from zero” (Rigby, 2023b).

The residuals of the regression show spatial autocorrelation in the quantile map and the Moran’s I scatterplot. In the Moran’s I scatterplot, most observations fall into the first and third quadrants, the best-fit line is positive, and the Moran’s I value is 0.326, all indicating positive spatial autocorrelation of the OLS residuals. The differences in actual values compared to predicted values indicates that “it is likely that one or more of the assumptions underpinning statistical inference have been violated,” and as a result, examination of spatial lag and spatial error are warranted to determine the spatial dependence (Rigby, 2023c).

Spatial Lag Model

The spatial lag model shows a lag coefficient (Rho) and the W_cases_per variable coefficient with statistically significant values of 0.51735, which represents the “spatially lagged values of the dependent variable and their influence on the actual dependent variable” (Rigby, 2023c, 10:29). The Likelihood Ratio Test value of 596.2119 and probability value of 0.0 indicates that the spatial lag variables are significant regarding its effect on Covid death rates. The partial regression coefficients for the independent variables have not changed significantly, but exhibit nominal value decreases. With spatial lag controlled for, the estimations are more dependable. The R2 value indicates that this model accounts for about 33% of the variation in Covid death rates, an improvement from the R2 value of ~ 14% in the OLS model. The Akaike info criterion value has also decreased in comparison to the OLS model, indicating that the spatial lag model is a better fit.

Spatial Error Model

Lambda represents the spatial error in the model with a value of 0.538533 and a Likelihood Ratio Test value of 619.9093 with a probability of 0.0. The R2 value shows even further improvement, with the model accounting for about 34% of the variation in Covid death rates. The Likelihood Ratio Test value of 619.9093 and probability value of 0.0 is higher than the value from the spatial lag model, indicating that this model is “capturing more of the spatial dependence in the data” (Rigby, 2023). The partial regression coefficients for the independent variables have changed and exhibit value increases compared to the spatial lag model. The Akaike info criterion value has also decreased from the spatial lag model, indicating that the spatial error model is a better fit. The Likelihood Ratio Test value indicating more spatial dependence is captured indicates that the spatial error model is the model to select.

The maps and Moran’s I scatterplots also support the findings mentioned above with little autocorrelation, and the z-scores indicate no autocorrelation.

Loading the St. Louis Homicide Data

Data used in this analysis: Homicide data in the regions surrounding St. Louis, Missouri between 2020 and 2021.

The extent of the data is provided below.

This analysis utilizes R Studio and GeoDa. R code is presented in light blue boxes, while the resulting output is provided in light grey boxes.

Regression

regression <- lm(HR8893 ~ PE87 + RDAC90, data=stlouis)
summary(regression)
Call:
lm(formula = HR8893 ~ PE87 + RDAC90, data = stlouis)
Residuals:
Min 1Q Median 3Q Max
-8.8749 -2.7648 -0.6719 2.1715 20.2329
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03748 1.54263 0.024 0.981
PE87 1.56705 0.37128 4.221 6.75e-05 ***
RDAC90 5.29091 0.82492 6.414 1.14e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.692 on 75 degrees of freedom
Multiple R-squared: 0.4065, Adjusted R-squared: 0.3906
F-statistic: 25.68 on 2 and 75 DF, p-value: 3.194e-09

The regression shows that for every one-unit increase in police expenditure, there is a 1.56705 unit increase in the homicide rate. For the local deprivation index variable, every one-unit increase results in a 5.29091 unit increase in the homicide rate variable. The R2 value of 0.4065 indicates that this model accounts for 40% of the variation in homicide rates. The residual standard error value of 4.692 evidences error is present within the model. The F-statistic value of 25.68 indicates that at least one of the slope coefficients is significantly different from 0. The t-values and p-values indicate that they are statistically significant. Both variables carry the sign expected in relation to homicide rates.

Geographically Weighted Regression

Output.Areas <- readOGR(".", "stlouis")
OGR data source with driver: ESRI Shapefile
Source: "C:\Users\katee\Downloads\stlouis", layer: "stlouis"
with 78 features
It has 23 fields
Integer64 fields read as strings: STFIPS COFIPS FIPSNO
GWRbandwidth <- gwr.sel(HR8893 ~ PE87 + RDAC90, data=Output.Areas,adapt = T)
Adaptive q: 0.381966 CV score: 2462.794
Adaptive q: 0.618034 CV score: 2493.324
Adaptive q: 0.236068 CV score: 2309.391
Adaptive q: 0.145898 CV score: 2094.679
Adaptive q: 0.09016994 CV score: 1394.082
Adaptive q: 0.05572809 CV score: 1032.717
Adaptive q: 0.03444185 CV score: 903.0778
Adaptive q: 0.02128624 CV score: 840.6013
Adaptive q: 0.01315562 CV score: 843.9937
Adaptive q: 0.01808047 CV score: 837.933
Adaptive q: 0.01803978 CV score: 837.9085
Adaptive q: 0.0161742 CV score: 837.8115
Adaptive q: 0.01701687 CV score: 837.5713
Adaptive q: 0.01705756 CV score: 837.5728
Adaptive q: 0.01697618 CV score: 837.5709
Adaptive q: 0.01666985 CV score: 837.6052
Adaptive q: 0.01693549 CV score: 837.5716
Adaptive q: 0.01697618 CV score: 837.5709
gwr.model = gwr(HR8893 ~ PE87 + RDAC90, data=Output.Areas, adapt=GWRbandwidth, hatmatrix=TRUE, se.fit=TRUE)

gwr.model

Call:
gwr(formula = HR8893 ~ PE87 + RDAC90, data = Output.Areas, adapt = GWRbandwidth,
hatmatrix = TRUE, se.fit = TRUE)
Kernel function: gwr.Gauss
Adaptive quantile: 0.01697618 (about 1 of 78 data points)
Summary of GWR coefficient estimates at data points:
Min. 1st Qu. Median 3rd Qu. Max. Global
X.Intercept. -1.86007 1.34844 3.38473 5.33793 18.08941 0.0375
PE87 -3.31961 -0.18401 0.44314 1.00753 2.69020 1.5670
RDAC90 -9.17972 0.57759 1.87845 6.50702 11.94509 5.2909
Number of data points: 78
Effective number of parameters (residual: 2traceS - traceS'S): 47.66878
Effective degrees of freedom (residual: 2traceS - traceS'S): 30.33122
Sigma (residual: 2traceS - traceS'S): 2.672647
Effective number of parameters (model: traceS): 36.96118
Effective degrees of freedom (model: traceS): 41.03882
Sigma (model: traceS): 2.297678
Sigma (ML): 1.66663
AICc (GWR p. 61, eq 2.33; p. 96, eq. 4.21): 452.7336
AIC (GWR p. 96, eq. 4.22): 338.001
Residual sum of squares: 216.6572
Quasi-global R2: 0.9221223
results <-as.data.frame(gwr.model$SDF)
names(results)
[1] "sum.w" "X.Intercept." "PE87" "RDAC90" "X.Intercept._se"
[6] "PE87_se" "RDAC90_se" "gwr.e" "pred" "pred.se"
[11] "localR2" "X.Intercept._se_EDF" "PE87_se_EDF" "RDAC90_se_EDF" "pred.se.1"
gwr.map <- cbind(stlouis, as.matrix(results))
qtm(gwr.map, fill = "localR2") + tm_legend(legend.position = c("right", "bottom"), main.title = "St. Louis: Hom, PD Exp, Depr. Idx",main.title.position = "right")

The R2 map indicates that the model accounts for a larger portion of homicide rates in certain areas of St. Louis. The inner city and eastern edges of the St. Louis area seem to have homicide rates that can be 80-100% accounted for based on the deprivation index values and police expenditure. In contrast, homicide rates in the southern region of the city are least accounted for in the model, accounting for between 2 and 4% of the homicide rates. This map shows there is spatial autocorrelation regarding how effective the model is.

map <- tm_shape(gwr.map) + tm_fill("PE87.1", n = 5, style = "quantile", title = "Police Expenditure") + tm_layout(frame = FALSE, legend.text.size = 0.5, legend.position = c("right", "bottom"),legend.title.size = 0.6)

This map indicates that the impacts of police expenditure on homicide rates are different in areas of St. Louis. In some regions of St. Louis, police expenditure increase relates to increased homicide rates (green values) while other regions see the opposite (red and some white/yellow values).

map2 <- tm_shape(gwr.map) + tm_fill("RDAC90.1", n = 5, style = "quantile", title = "Local Depr. Idx", midpoint = NA) + tm_layout(frame = FALSE, legend.text.size = 0.5, legend.position = c("right", "bottom"),legend.title.size = 0.6)

This map indicates that the impacts of deprivation index values on homicide rates are different in areas of St. Louis. In some regions of St. Louis, deprivation index value increases relate to an increased homicide rates (orange and all green values), while other regions see the opposite (red).

References

Rigby, D. (2023a). Lecture Video 2: Introduction to Regression 2: 23W-GEOG-413-LEC-1 Applied Geospatial Statistics.

Rigby, D. (2023b). Lecture Video 3: Extension to Multiple Regression: 23W-GEOG-413-LEC-1 Applied Geospatial Statistics.

Rigby, D. (2023c). Lecture Video 4: Spatial Regression Models: 23W-GEOG-413-LEC-1 Applied Geospatial Statistics.

Previous
Previous

Spatial Autocorrelation