4 Regression

library(readxl)
library(ggplot2)
library(dplyr)
library(magrittr)
library(ggplot2)
library(GGally)

We process the dataset to be able to work with it.

colnames(cs) <- c("Fecha", "Territorio", "Confirmados_PCR", "Hospitalizados", "UCI", "Curados", "Defunciones", "Total_confirmados")
juntos <- 
  filter(cs, cs$Territorio == "Andalucía")
juntos <- mutate(juntos, Total = cumsum(Total_confirmados))
                        
                        
c <- juntos$Curados
h <- juntos$Hospitalizados
d <- juntos$Defunciones
uci <- juntos$UCI
conf <- juntos$Confirmados_PCR
totalconf <- juntos$Total_confirmados
salidac <- vector("numeric",length(c))
salidah <-vector("numeric",length(h))
salidad <- vector("numeric",length(d))
salidauci <- vector("numeric",length(uci))
salidaconf <- vector("numeric",length(conf))
salidatotalconf <- vector("numeric",length(totalconf))
for(i in seq_along(c)){
  
  salidac[length(c)+1-i] <- c[i]
  salidah[length(h)+1-i] <- h[i]
  salidad[length(d)+1-i] <- d[i]
  salidauci[length(uci)+1-i] <- uci[i]
  salidaconf[length(conf)+1-i] <- conf[i]
  salidatotalconf[length(totalconf)+1-i] <- totalconf[i]
  
}
juntos$Curados <- salidac
juntos$Hospitalizados <- salidah
juntos$Defunciones <- salidad
juntos$UCI <- salidauci
juntos$Confirmados_PCR <- salidaconf
juntos$Total_confirmados <- salidatotalconf
                     
  fechas <- as.Date(juntos$Fecha ,"%d/%m/%Y")
juntos$Fecha <- sort(fechas)                      

We make the graph of how the cases have increased as a function of time.

attach(juntos)
grafica <- juntos %>% ggplot(aes(x= Fecha, y= Total))+geom_point()
grafica

In this graph we can see how the number of cases up to May increased exponentially. However, starting from this date, the number of cases per day begins to decrease, reaching the famous peak.

4.1 Model of the number of infections with respect to the date

modFC <- lm(formula = juntos$Total ~ juntos$Fecha, data = juntos)
modFC
## 
## Call:
## lm(formula = juntos$Total ~ juntos$Fecha, data = juntos)
## 
## Coefficients:
##  (Intercept)  juntos$Fecha  
##   -4444803.2         242.5
summary(modFC)
## 
## Call:
## lm(formula = juntos$Total ~ juntos$Fecha, data = juntos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2038.5 -1147.8  -133.7  1165.0  2423.5 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.445e+06  1.110e+05  -40.03   <2e-16 ***
## juntos$Fecha  2.425e+02  6.047e+00   40.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1420 on 85 degrees of freedom
## Multiple R-squared:  0.9498, Adjusted R-squared:  0.9492 
## F-statistic:  1609 on 1 and 85 DF,  p-value: < 2.2e-16

We start by checking if there is a model, for this we look at the values of F-statistic and p-value. ** F-statistic ** is quite far from 1 (1609), so it indicates that there is a model. ** P-value **, is well below 0.005, confirming that there is a model (H1 is met) Now that we know there is a model, let’s study how good our model is.

We look at \(R^2\), which has a value of 0.9498, a pretty good value, since 94.98% of the cases are collected with this model. We also see that there is almost no difference between adjusted \(R^2\) and \(R^2\), so there is no overfitting in our model and that the values are relevant (indicated by ***) Let’s continue studying the model, for this we will see the graphs of the adjusted values and residuals.

plot(modFC)

In these graphs we can see how it begins adjusting to the values. However, as cases increase, there is more waste. This may be due to the results of the contingency measures that were taken. With all this we could say that we have a pretty good model. ## Model of the number of people who were hospitalized with respect to the number of infected.

mHospitalizados <- lm(formula = Hospitalizados ~ Total_confirmados, data = juntos)
mHospitalizados
## 
## Call:
## lm(formula = Hospitalizados ~ Total_confirmados, data = juntos)
## 
## Coefficients:
##       (Intercept)  Total_confirmados  
##          -31.5285             0.5212
summary(mHospitalizados)
## 
## Call:
## lm(formula = Hospitalizados ~ Total_confirmados, data = juntos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.696 -24.055   4.173  29.359  84.144 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -31.52846    5.98438  -5.268 1.02e-06 ***
## Total_confirmados   0.52117    0.02246  23.200  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.37 on 85 degrees of freedom
## Multiple R-squared:  0.8636, Adjusted R-squared:  0.862 
## F-statistic: 538.3 on 1 and 85 DF,  p-value: < 2.2e-16

We start by checking if there is a model, for this we look at the values of F-statistic and p-value. F-statistic is quite far from 1 (538), so it indicates that there is a model. P-value, is well below 0.005, which confirms that there is a model (H1 is met) Now that we know there is a model, let’s study how good our model is. We look at \(R^2\), which has a value of 0.8636, a pretty good value, since 86.36% of hospitalization cases are included in this model. We also see that there is almost no difference between adjusted \(R^2\) and \(R^2\), so there is no overfitting in our model and that the values are relevant (indicated by ***) Let’s continue studying the model, for this we will see the graphs of the adjusted values and residuals.

plot(mHospitalizados)

In the graphs we can see how there is dispersion of the values, and how it fits better at the beginning than at the end. We also observed that at the beginning of the period the number of hospitalized was very close to the number of cases. But as time progresses, the number of cases increases considerably, but not so much the number of hospitalized. This may be due to the fact that the population was informed of the first symptoms and the measures they had to take, so that an infected person could be detected in the early stages and monitored so that the severity could be reduced.

library(readxl)
library(tidyverse)
datos <- read_excel("cs_export.xls")
datos <- na.omit(datos)
names(datos) <- c(names(datos[1:2]),"Confirmados_PCR",names(datos[4:7]),"Total_confirmados")
varPred <- names(datos[c(3:6,8)])
datos$`Fecha declaración` <- as.Date(datos$`Fecha declaración`, "%d/%m/%Y")
datos <- arrange(datos, `Fecha declaración`)
filasandalucia <- filter(datos, Territorio=="Andalucía" )

We are going to produce a multivariate regression model that explains the variable Defunction using an iterative technique in which we will add each time the variable whose R-adjusted is greater.

We define a function to calculate the linear model of a sum of variables

linearAdjust <- function(df, y, x){
  mod1 <- lm(str_c(y,"~",str_c(x,collapse="+")),df)
}
calcModR2 <- function(df,y,x){
  mod <- linearAdjust(df,y,x)
  summary(mod)$adj.r.squared
}

We are adding variables while increasing the value of adjusted \(R^2\).

encontrarMejorAjuste <- function(df,varPos){
  bestVars <- character(0)
  aR2 <- 0
  repeat{
    aR2v <- map_dbl(varPos,~calcModR2(df, "Defunciones",c(bestVars,.)))
    i    <- which.max(aR2v)
    aR2M <- aR2v[i]
    
    if(aR2M<=aR2 ||length(varPos)<1) break
    #Valor del r-ajustado añadido y nombre de la variable elegida
    cat(sprintf("%1.4f %s\n",aR2M,varPos[i]))
    aR2 <- aR2M
    bestVars <- c(bestVars,varPos[i])
    varPos <- varPos[-i]
  }
  mod <- linearAdjust(df,"Defunciones",bestVars)
  list(vars=bestVars,mod=mod)
}
bestMod <- encontrarMejorAjuste(filasandalucia, varPred)
## 0.9499 Hospitalizados
## 0.9521 UCI
## 0.9530 Curados
## 0.9566 Total_confirmados
## 0.9642 Confirmados_PCR
bestMod
## $vars
## [1] "Hospitalizados"    "UCI"               "Curados"          
## [4] "Total_confirmados" "Confirmados_PCR"  
## 
## $mod
## 
## Call:
## lm(formula = str_c(y, "~", str_c(x, collapse = "+")), data = df)
## 
## Coefficients:
##       (Intercept)     Hospitalizados                UCI            Curados  
##           0.26562            0.11219            0.35878           -0.21053  
## Total_confirmados    Confirmados_PCR  
##           0.10853            0.08473
summary(bestMod$mod)
## 
## Call:
## lm(formula = str_c(y, "~", str_c(x, collapse = "+")), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2602  -2.2208  -0.3433   1.8867  13.9259 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.26562    0.82376   0.322 0.747941    
## Hospitalizados     0.11219    0.03151   3.561 0.000623 ***
## UCI                0.35878    0.10317   3.478 0.000816 ***
## Curados           -0.21053    0.03937  -5.347 8.07e-07 ***
## Total_confirmados  0.10853    0.02412   4.499 2.26e-05 ***
## Confirmados_PCR    0.08473    0.01978   4.284 5.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.285 on 81 degrees of freedom
## Multiple R-squared:  0.9663, Adjusted R-squared:  0.9642 
## F-statistic: 463.9 on 5 and 81 DF,  p-value: < 2.2e-16

The model obtained would be of the form:

\[ Defunciones=0.2656235 +0.1121855·Hospitalizados+0.3587845·UCI+-0.2105297·ConfirmadosPCR+0.1085301·Curados+0.0847283·TotalConfirmados \]

Finally we can represent the graph of the data (purple) with the regression obtained superimposed (red)

g <- ggplot(filasandalucia, aes(x=`Fecha declaración`,y=Defunciones))+
  ggtitle("Regresión multilineal")+
  geom_point(colour="purple")+
  geom_line(aes(`Fecha declaración`, predict.lm(bestMod$mod)), color="red")
  
g