Chapter 3 Data transformation
3.1 Disaster Data
# Load disaster data and map message data
<- readxl::read_xlsx("data/emdat_public_2022_04_30_full.xlsx")
raw_data <- map_data("world") latitude.longtitude.data
We want to add location information to disaster data for graphical visualization. However, due to the large time span of our data, some countries have experienced splitting or merging in this 100-year period, so in order to ensure that our data can be merged with the data of geographic information through the country name, we need to clean our countries’ names first before doing the merge.
3.1.1 checking the country name
# Apply the countryname function to standardize the countryname of each dataset
countryname(unique(raw_data$Country))
countryname(unique(latitude.longtitude.data$region))
Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, :
Some values were not matched unambiguously: Azores Islands, Canary Is, Serbia Montenegro, Yemen P Dem Rep
Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, :
Some values were not matched unambiguously: Ascension Island, Azores, Barbuda, Canary Islands, Chagos Archipelago, Grenadines, Heard Island, Madeira Islands, Micronesia, Saba, Saint Martin, Siachen Glacier, Sint Eustatius, Virgin Islands
There are several reasons for mismatching:
- First situation, the countries are too small to be included in the package.(e.g. “Tuvalu”, “Micronesia (Federated States of)”)
- Second situation, the countries are no longer exist since our data contains data from 1900. (e.g. “Serbia Montenegro”, “Yemen P Dem Rep”)
- “Yemen P Dem Rep” merged with “Yemen Arab Rep” in 1990 and is called “Yemen” now.
- Third situation, this does not show up here, but we need to consider the situation that some of the countries disintegrated during the past 122 years. (e.g. “Czechoslovakia”, “Yugoslavia”) In this situation, we may need to change the data of one country into several.
- “Serbia Montenegro” split into two countries called “Serbia” and “Montengro” in 2006.
- “Yugoslavia” had split into six countries,“Slovenia”,“Croatia”, “Serbia”, “Montengro”, “Bosnia and Herzegovina” and “Macedonia”.
- “Czechoslovakia” had split into two countries, “Czech Republic” and “Slovakia”, in 1993.
3.1.2 Manual Adjustments
Based on the above three situation, we did some manual adjustment(splitting) on our data, and reread in the data.
Our split countries are:“Czechoslovakia”, “Yugoslavia”,and “Serbia Montenegro”.
# data after splitting certain countries
<- readxl::read_xlsx("data/disaster data.xlsx")
data_temp
# some manual adjustments
# delete "Micronesia (Federated States of)" and "Tuvalu", which does not included in the existing data, hence do not have latitude and longitude information.
<- data_temp[-which(data_temp$Country %in% c("Micronesia (Federated States of)","Tuvalu")),]
data_remove # manual match some small contries: Azores Islands - Azores; Canary Is - Canary Islands
$Country[data_remove$Country == "Azores Islands"] <- "Azores"
data_remove$Country[data_remove$Country == "Canary Is"] <- "Canary Islands"
data_remove# merge "Yemen P Dem Rep" and "Yemen Arab Rep"
$Country[data_remove$Country %in% c("Yemen P Dem Rep","Yemen Arab Rep")] <- "Yemen"
data_remove# Virgin Islands are one area in existing data, so merge together: Virgin Island (British) & Virgin Island (U.S.) - Virgin Islands
$Country[data_remove$Country == "Virgin Island (British)"] <- "Virgin Islands"
data_remove$Country[data_remove$Country == "Virgin Island (U.S.)"] <- "Virgin Islands"
data_remove# "Hong Kong" and "Macao" belong to China now
$Country[data_remove$Country == "Hong Kong"] <- "China"
data_remove$Country[data_remove$Country == "Macao"] <- "China"
data_remove# "Netherlands Antilles" belongs to "Caribbean Netherlands", but does not have independent geographic information, so including it in "Netherlands"
$Country[data_remove$Country == "Netherlands Antilles"] <- "Netherlands"
data_remove# same situation, including "Saint Martin (French part)" in "Saint Martin"
$Country[data_remove$Country == "Saint Martin (French Part)"] <- "Saint Martin"
data_remove# "Tokelau" is belong to "New Zealand"
$Country[data_remove$Country == "Tokelau"] <- "New Zealand"
data_remove
# Replace the column "Country" with the standard country names, and when mismatch then the country must be one of our adjusted countries, so just keep it as it is.
<- data_remove %>%
disaster_match mutate(Country = ifelse(is.na(countryname(data_remove$Country)),data_remove$Country,countryname(data_remove$Country)))
#The data frame has a total of 50 variables, but we won't use all of them, so we need to tidy the data.
<- disaster_match %>%
disaster select("Year","Disaster Group","Disaster Subgroup","Disaster Type","Disaster Subtype","Country","ISO","Region",
"Continent","Total Deaths","Total Damages ('000 US$)","Total Damages, Adjusted ('000 US$)")
colnames(disaster) <- c("Year","Disaster Group","Disaster Subgroup","Disaster Type","Disaster Subtype","Country","ISO","Region","Continent","Total Deaths","Total Damages","Total Damages Adjusted")
3.2 Covid Data
The number of deaths in this data is refreshed daily in the last column and is cumulative so we only need to keep the last column for the number of deaths in each country. Because some countries count the number of deaths according to different cities, we need to add up the number of deaths in different cities in each country and combine them into the total number of deaths in a country.
#Loading covid data
= read.csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv')
df_covid_raw = df_covid_raw %>% select(ncol(df_covid_raw))
covid_df = cbind("Country.Region" = df_covid_raw$Country.Region,
covid_df "Total.Death" = rowSums(covid_df)) %>% data.frame()
= covid_df %>% mutate("Total.Death" = as.numeric(Total.Death)) %>%
covid_df group_by(Country.Region) %>%
summarise("Total.Death" = sum(Total.Death, na.rm = T))
In order to compare the data together, we also need to unify the country names of the data according to the corrected disaster data.
#Apply the countryname function to standardize the countryname of dataset
countryname(unique(covid_df$Country.Region))
Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, :
Some values were not matched unambiguously: Diamond Princess, Micronesia, MS Zaandam, Summer Olympics 2020, Winter Olympics 2022
# remove Summer Olympics 2020 & Winter Olympics 2022,Micronesia
<- covid_df[-which(covid_df$Country.Region %in% c("Summer Olympics 2020", "Winter Olympics 2022", "Micronesia")),]
covid_remove # Diamond Princess is a Japanese cruise ship, so it counts as Japan
$Country.Region[covid_remove$Country.Region == "Diamond Princess"] <- "Japan"
covid_remove# MS Zaandam is a Netherlands cruise ship, so it counts as Netherlands
$Country.Region[covid_remove$Country.Region == "MS Zaandam"] <- "Netherlands"
covid_remove
# adding a column called "country.name" to be the standard country names, and when mismatch then the country must be one of our adjusted countries, so just keep it as it is.
<- covid_remove %>%
covid_match mutate(Country.Region = ifelse(is.na(countryname(covid_remove$Country.Region)),covid_remove$Country.Region,countryname(covid_remove$Country.Region)))
3.3 GDP time series data
# Load the GDP raw data
= read.csv("data/world_bank_gdp.csv") GDP_temp
The country name of this data set contains two parts, which is the names of difference countries and different regions. We only consider about the time series data of the country so we delete the country name of different regions. Also, the name of time series variables is not beautiful so we also change the variable name that makes it look more properly.
# We only need the country name, country code and time series data
<- GDP_temp[c(1:which(GDP$Country.Name == "Zimbabwe")),-c(1,2)] GDP
The missing value of this data is not represented by NULL or NA, but by “..”. This is not convenient for our subsequent data reference and processing, so we directly use Excel to convert all “..” into NULL.
# Load the GDP with missing value expressed by NULL
= read.csv("data/GDP.csv") GDP
We want to add a location information to this data so we need to unify the country names.
<- GDP %>%
GDP mutate(Country.Name = ifelse(is.na(countryname(GDP$Country.Name)),GDP$Country.Name,countryname(GDP$Country.Name)))
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Curaçao, São Tomé & PrÃncipe
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Curaçao, São Tomé & PrÃncipe
3.4 latitude longtitude data
Although the country name of this data is sufficiently standardized, in order for all the data to match each other, we also need to modify the country name.
# Load the data
<- map_data("world")
latitude.longtitude.data
# # adding a column called "Country" to be the standard country names, and when mismatch then the country must be one of our adjusted countries, so just keep it as it is.
<- latitude.longtitude.data %>%
map_match mutate(Country = ifelse(is.na(countryname(latitude.longtitude.data$region)),latitude.longtitude.data$region,countryname(latitude.longtitude.data$region))) %>%
select(long,lat,group,order,Country,subregion)