TidyTusdays are a weekly R Community event where people learn about RStats by practising with a different data set each week. Last week the Cardiff R User group worked on the volcanoes data-set :volcano:, and something in it really tripped me up. We’ve done this a few times now, and are building up our work in this GitHub repo.
<- readr::
volcano
Let’s have a look at what the data is
volcano %>%
## tibble [958 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ volcano_number : num [1:958] 283001 355096 342080 213004 321040 ...
## $ volcano_name : chr [1:958] "Abu" "Acamarachi" "Acatenango" "Acigol-Nevsehir" ...
## $ primary_volcano_type : chr [1:958] "Shield(s)" "Stratovolcano" "Stratovolcano(es)" "Caldera" ...
## $ last_eruption_year : chr [1:958] "-6850" "Unknown" "1972" "-2080" ...
## $ country : chr [1:958] "Japan" "Chile" "Guatemala" "Turkey" ...
## $ region : chr [1:958] "Japan, Taiwan, Marianas" "South America" "México and Central America" "Mediterranean and Western Asia" ...
## $ subregion : chr [1:958] "Honshu" "Northern Chile, Bolivia and Argentina" "Guatemala" "Turkey" ...
## $ latitude : num [1:958] 34.5 -23.3 14.5 38.5 46.2 ...
## $ longitude : num [1:958] 131.6 -67.6 -90.9 34.6 -121.5 ...
## $ elevation : num [1:958] 641 6023 3976 1683 3742 ...
## $ tectonic_settings : chr [1:958] "Subduction zone / Continental crust (>25 km)" "Subduction zone / Continental crust (>25 km)" "Subduction zone / Continental crust (>25 km)" "Intraplate / Continental crust (>25 km)" ...
## $ evidence_category : chr [1:958] "Eruption Dated" "Evidence Credible" "Eruption Observed" "Eruption Dated" ...
## $ major_rock_1 : chr [1:958] "Andesite / Basaltic Andesite" "Dacite" "Andesite / Basaltic Andesite" "Rhyolite" ...
## $ major_rock_2 : chr [1:958] "Basalt / Picro-Basalt" "Andesite / Basaltic Andesite" "Dacite" "Dacite" ...
## $ major_rock_3 : chr [1:958] "Dacite" " " " " "Basalt / Picro-Basalt" ...
## $ major_rock_4 : chr [1:958] " " " " " " "Andesite / Basaltic Andesite" ...
## $ major_rock_5 : chr [1:958] " " " " " " " " ...
## $ minor_rock_1 : chr [1:958] " " " " "Basalt / Picro-Basalt" " " ...
## $ minor_rock_2 : chr [1:958] " " " " " " " " ...
## $ minor_rock_3 : chr [1:958] " " " " " " " " ...
## $ minor_rock_4 : chr [1:958] " " " " " " " " ...
## $ minor_rock_5 : chr [1:958] " " " " " " " " ...
## $ population_within_5_km : num [1:958] 3597 0 4329 127863 0 ...
## $ population_within_10_km : num [1:958] 9594 7 60730 127863 70 ...
## $ population_within_30_km : num [1:958] 117805 294 1042836 218469 4019 ...
## $ population_within_100_km: num [1:958] 4071152 9092 7634778 2253483 393303 ...
## - attr(*, "spec")=
## .. cols(
## .. volcano_number = col_double(),
## .. volcano_name = col_character(),
## .. primary_volcano_type = col_character(),
## .. last_eruption_year = col_character(),
## .. country = col_character(),
## .. region = col_character(),
## .. subregion = col_character(),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. elevation = col_double(),
## .. tectonic_settings = col_character(),
## .. evidence_category = col_character(),
## .. major_rock_1 = col_character(),
## .. major_rock_2 = col_character(),
## .. major_rock_3 = col_character(),
## .. major_rock_4 = col_character(),
## .. major_rock_5 = col_character(),
## .. minor_rock_1 = col_character(),
## .. minor_rock_2 = col_character(),
## .. minor_rock_3 = col_character(),
## .. minor_rock_4 = col_character(),
## .. minor_rock_5 = col_character(),
## .. population_within_5_km = col_double(),
## .. population_within_10_km = col_double(),
## .. population_within_30_km = col_double(),
## .. population_within_100_km = col_double()
## .. )
Lots of character columns, and some with some slightly funky formatting, such as /
and variations on a theme with (s)
and (es)
. We’ve also got a bunch of ‘sparse’ data in the columns that start with major_rock
or minor_rock
that look like spaces. R has a rich set of tools for dealing with missing data a little more effectively, so lets clean this up by setting the missing data to be explicit NA
. In this case, as the column is a character type, we need to NA_character
to fill it up.
volcano %>%
%>%
%>%
%>%
knitr::
major_rock_1 | major_rock_2 | major_rock_3 | major_rock_4 | major_rock_5 | minor_rock_1 | minor_rock_2 | minor_rock_3 | minor_rock_4 | minor_rock_5 |
---|---|---|---|---|---|---|---|---|---|
Andesite / Basaltic Andesite | Basalt / Picro-Basalt | Dacite | |||||||
Dacite | Andesite / Basaltic Andesite | ||||||||
Andesite / Basaltic Andesite | Dacite | Basalt / Picro-Basalt | |||||||
Rhyolite | Dacite | Basalt / Picro-Basalt | Andesite / Basaltic Andesite | ||||||
Andesite / Basaltic Andesite | Basalt / Picro-Basalt | Dacite | |||||||
Andesite / Basaltic Andesite | Dacite | Basalt / Picro-Basalt |
Well, that doesn’t quite work. What I want is to have the blank spots filled up with NA
. Is it my code? It’s not the most basic solution, using some of the tidyeval concepts such as vars
and funs
. Lets make it as simple as possible.
volcano %>%
%>%
knitr::
volcano_number | volcano_name | primary_volcano_type | last_eruption_year | country | region | subregion | latitude | longitude | elevation | tectonic_settings | evidence_category | major_rock_1 | major_rock_2 | major_rock_3 | major_rock_4 | major_rock_5 | minor_rock_1 | minor_rock_2 | minor_rock_3 | minor_rock_4 | minor_rock_5 | population_within_5_km | population_within_10_km | population_within_30_km | population_within_100_km |
---|
Odd. I can’t find any values that are just spaces, even though they are printed out that way! I know they are there, I can see them! Luckily, the point of these projects is to learn, and to learn from each other in the group :school:.
My buddy Heather was I think a little surprised when I demonstrated this, but within a few minutes she’d worked it out. It’s encoded as a non_breaking space. She linked this blog in our chat about non-braking spaces and offered us the cryptic solution:
"\u00A0"
The blog goes into detail about what this is but the tl;dr is: “It looks like a space but it’s not and it’s designed that way”.
So a quick tweak to the code and…
volcano %>%
%>%
%>%
%>%
knitr::
major_rock_1 | major_rock_2 | major_rock_3 | major_rock_4 | major_rock_5 | minor_rock_1 | minor_rock_2 | minor_rock_3 | minor_rock_4 | minor_rock_5 |
---|---|---|---|---|---|---|---|---|---|
Andesite / Basaltic Andesite | Basalt / Picro-Basalt | Dacite | NA | NA | NA | NA | NA | NA | NA |
Dacite | Andesite / Basaltic Andesite | NA | NA | NA | NA | NA | NA | NA | NA |
Andesite / Basaltic Andesite | Dacite | NA | NA | NA | Basalt / Picro-Basalt | NA | NA | NA | NA |
Rhyolite | Dacite | Basalt / Picro-Basalt | Andesite / Basaltic Andesite | NA | NA | NA | NA | NA | NA |
Andesite / Basaltic Andesite | Basalt / Picro-Basalt | NA | NA | NA | Dacite | NA | NA | NA | NA |
Andesite / Basaltic Andesite | NA | NA | NA | NA | Dacite | Basalt / Picro-Basalt | NA | NA | NA |
Success! After digging around a little, I discovered str_trim
and str_squish
can be used for this as well to make a perfectly tidy solution!
volcano %>%
%>%
%>%
%>%
knitr::
major_rock_1 | major_rock_2 | major_rock_3 | major_rock_4 | major_rock_5 | minor_rock_1 | minor_rock_2 | minor_rock_3 | minor_rock_4 | minor_rock_5 |
---|---|---|---|---|---|---|---|---|---|
Andesite / Basaltic Andesite | Basalt / Picro-Basalt | Dacite | NA | NA | NA | NA | NA | NA | NA |
Dacite | Andesite / Basaltic Andesite | NA | NA | NA | NA | NA | NA | NA | NA |
Andesite / Basaltic Andesite | Dacite | NA | NA | NA | Basalt / Picro-Basalt | NA | NA | NA | NA |
Rhyolite | Dacite | Basalt / Picro-Basalt | Andesite / Basaltic Andesite | NA | NA | NA | NA | NA | NA |
Andesite / Basaltic Andesite | Basalt / Picro-Basalt | NA | NA | NA | Dacite | NA | NA | NA | NA |
Andesite / Basaltic Andesite | NA | NA | NA | NA | Dacite | Basalt / Picro-Basalt | NA | NA | NA |
Let’s try and breakdown what is happening in this solution by concept, and then outline the routine in human language to finish.
stringr::str_trim
and stringr::str_squish
removes space from either the ends or all the way through a character string depending on what else is happening in the string and what you need from the solution.mutate_at
is a buddy of mutate
, where you use functional programming style to apply a function over a collection of columns.""
, where as previously we were seeing " "
which was actually encoded as "\u00A0"
starts_with
is a select helper that is designed for cases when a related value is stretch wide across multiple columns with similar names, and returns a vector of column names filtered to your criteria.vars
automatically quotes the names of the columns to evaluate later in context and is almost always used as a wrapper to the .var =
argument when it’s supported by a function..funs =
argument, like .var =
, has a counterpart funs()
, but this is being deprecated in favour of the expression notation.case_when
is an alternative version of the more common if else
operation.~
is a special operator that is key to the expression notation. It effectively separates the Left Hand Side (LHS) of an expression from the Right Hand Side (RHS). It’s used in two ways in this code. case_when
uses full expressions to represent what should happen on the RHS when the criteria of the LHS is met. .funs =
uses it to make a lambda style formula. This is sometimes referred to as a quosure..
is also a special operator, and I recommend reading the documentation of pipe %>%
. The idea of it here is to reference the data being operated on itself. In this specific case, it’s each value from each of the selected columns for equivalence to an empty space.Did you follow all that? It’s a minefield I know, but it allows us to do something very powerful in only a few lines. As an alternative way of understanding what this does, here’s the step by step:
"major_rock'
or "minor_rock"
NA
value for charactersSo do you have to program this way? No, not really. You could manually create the list of columns you want to modify, but that would be prone to human error and what if you end up with "minor_rock_6"
or "major_rock_100"
? You could always make a traditional if else
structure, but that can get long fast if there are multiple conditions to check for. How about using the character string "\u00A0"
to test for equivalence? Well, that would work now, but how do you pick up if they change suddenly to actual spaces? Or another whitespace encoding? Programming like this keeps code readable, maintainable, and robust.
Is this overkill for tidy Tuesdays? Yes, absolutely. Are the problems that you solve with this approach purely academic? No, not at all. Plus, doing all that work in 129 characters is pretty neat. Excluding spaces.