Chapter 4 Missing values
4.1 Disaster dataset
= read.csv("data/disaster_missing.csv") %>%
disaster select(-X)
4.1.1 By row
rowSums(is.na(disaster)) %>%
sort(decreasing = TRUE) %>%
table()
## .
## 0 1 2 3 4
## 3555 1748 14669 4781 742
It shows that the missing value numbers with different rows. There are for 3555 rows with no missing values, 1748 rows with 1 missing values, 14669 rows with 2 missing values, 4781 rows with 3 missing values and 742 rows with 4 missing values.
Also we want to visualize it by year.
<- disaster %>%
missing group_by(Year) %>%
summarise(sum.na = sum(is.na(Total.Deaths)+is.na(Total.Damages)+is.na(Disaster.Subtype)+is.na(Total.Damages.Adjusted)))
ggplot(missing, aes(x = Year, y = sum.na)) +
geom_col(color = "blue", fill = "lightblue") +
ggtitle("Number of missing values by Year") +
xlab("") +
ylab("Number of missing station values(All Variables)") +
theme(axis.text.x = element_text(angle = 45)) +
scale_x_discrete(breaks = c('1900','1910','1920','1930','1940','1950','1960','1970','1980','1990','2000','2010','2020'))
It shows that the missing value has a increasing trend among all years because the data records are increasing over year.
4.1.2 By Column
colSums(is.na(disaster)) %>%
sort(decreasing = TRUE)
## Total.Damages.Adjusted Total.Damages Total.Deaths
## 19931 19917 5334
## Disaster.Subtype Year Disaster.Group
## 3215 0 0
## Disaster.Subgroup Disaster.Type Country
## 0 0 0
## ISO Region Continent
## 0 0 0
plot_missing(disaster, percent = FALSE)
It shows that damage and adj.damage gets the most nas, then deaths and subtype.
4.1.3 By Value
<- disaster %>%
tidydata rownames_to_column("id") %>%
gather(key, value, -id) %>%
mutate(missing = ifelse(is.na(value), "yes", "no"))
ggplot(tidydata, aes(x = key, y = fct_rev(id), fill = missing)) +
geom_tile() +
ggtitle("ourdata with NAs") +
scale_fill_viridis_d() + # discrete scale
theme_bw()+
scale_y_discrete(breaks = c())
It gives a look at the na for each value. However I don’t think that there is a relation between the nas in different variable.
4.2 Covid Dataset
= read.csv("data/covid.csv") %>%
covid select(-X)
sum(is.na(covid))
## [1] 0
There is no NA value in this data set.
4.3 GDP Dataset
= read.csv("data/GDP.csv") GDP
This is a time series data from 1960 to 2020 for the countries around the world, we want to check the na for each countries and for each Year.
4.3.1 For countries
We want to see how many NA are there for each country
= cbind(GDP[,1:2], NAs = rowSums(matrix(as.numeric(is.na(GDP[,-c(1,2)])), nrow = nrow(GDP)))) %>%
missing arrange(desc(NAs))
head(missing, 10)
## X Country.Name NAs
## 1 28 British Virgin Islands 61
## 2 76 Gibraltar 61
## 3 104 North Korea 61
## 4 184 Saint Martin (French part) 61
## 5 173 Sint Maarten 53
## 6 179 South Sudan 53
## 7 40 Channel Islands 51
## 8 51 Curaçao 51
## 9 138 Nauru 50
## 10 106 Kosovo 48
As we can see, the missing value are tend to appear on those small countries where data are hard to collect.
4.3.2 For Year
Now we want to know the missing value among time
= cbind(Year = 1960:2020, NAs = colSums(matrix(as.numeric(is.na(GDP[,-c(1,2)])), nrow = nrow(GDP)))) missing
## Warning in cbind(Year = 1960:2020, NAs = colSums(matrix(as.numeric(is.na(GDP[, :
## number of rows of result is not a multiple of vector length (arg 1)
plot(missing, main = 'Missing values among year')
Here as we can see, the missing values are reducing with the year increase. That’s because with the development of the technology, we have more and easier access to collect the data.