Rectangling

Introduction

Rectangling is the art and craft of taking a deeply nested list (often sourced from wild caught JSON or XML) and taming it into a tidy data set of rows and columns. There are three functions from tidyr that are particularly useful for rectangling:

A very large number of data rectangling problems can be solved by combining these functions with a splash of dplyr (largely eliminating prior approaches that combined mutate() with multiple purrr::map()s).

To illustrate these techniques, we’ll use the repurrrsive package, which provides a number deeply nested lists originally mostly captured from web APIs.

library(tidyr)
library(dplyr)
library(repurrrsive)

GitHub users

We’ll start with gh_users, a list which contains information about six GitHub users. To begin, we put the gh_users list into a data frame:

users <- tibble(user = gh_users)

This seems a bit counter-intuitive: why is the first step in making a list simpler to make it more complicated? But a data frame has a big advantage: it bundles together multiple vectors so that everything is tracked together in a single object.

Each user is a named list, where each element represents a column.

names(users$user[[1]])
#>  [1] "login"               "id"                  "avatar_url"         
#>  [4] "gravatar_id"         "url"                 "html_url"           
#>  [7] "followers_url"       "following_url"       "gists_url"          
#> [10] "starred_url"         "subscriptions_url"   "organizations_url"  
#> [13] "repos_url"           "events_url"          "received_events_url"
#> [16] "type"                "site_admin"          "name"               
#> [19] "company"             "blog"                "location"           
#> [22] "email"               "hireable"            "bio"                
#> [25] "public_repos"        "public_gists"        "followers"          
#> [28] "following"           "created_at"          "updated_at"

There are two ways to turn the list components into columns. unnest_wider() takes every component and makes a new column:

users %>% unnest_wider(user)
#> # A tibble: 6 x 30
#>   login     id avatar_url gravatar_id url   html_url followers_url following_url
#>   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
#> 1 gabo… 6.60e5 https://a… ""          http… https:/… https://api.… https://api.…
#> 2 jenn… 5.99e5 https://a… ""          http… https:/… https://api.… https://api.…
#> 3 jtle… 1.57e6 https://a… ""          http… https:/… https://api.… https://api.…
#> 4 juli… 1.25e7 https://a… ""          http… https:/… https://api.… https://api.…
#> 5 leep… 3.51e6 https://a… ""          http… https:/… https://api.… https://api.…
#> 6 masa… 8.36e6 https://a… ""          http… https:/… https://api.… https://api.…
#> # … with 22 more variables: gists_url <chr>, starred_url <chr>,
#> #   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#> #   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#> #   name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#> #   public_repos <int>, public_gists <int>, followers <int>, following <int>,
#> #   created_at <chr>, updated_at <chr>, bio <chr>, hireable <lgl>

But in this case, there are many components and we don’t need most of them so we can instead use hoist(). hoist() allows us to pull out selected components using the same syntax as purrr::pluck():

users %>% hoist(user, 
  followers = "followers", 
  login = "login", 
  url = "html_url"
)
#> # A tibble: 6 x 4
#>   followers login       url                            user             
#>       <int> <chr>       <chr>                          <list>           
#> 1       303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2       780 jennybc     https://github.com/jennybc     <named list [27]>
#> 3      3958 jtleek      https://github.com/jtleek      <named list [27]>
#> 4       115 juliasilge  https://github.com/juliasilge  <named list [27]>
#> 5       213 leeper      https://github.com/leeper      <named list [27]>
#> 6        34 masalmon    https://github.com/masalmon    <named list [27]>

hoist() removes the named components from the user list-column, so you can think of it as moving components out of the inner list into the top-level data frame.

GitHub repos

We start off gh_repos similarly, by putting it in a tibble:

repos <- tibble(repo = gh_repos)
repos
#> # A tibble: 6 x 1
#>   repo       
#>   <list>     
#> 1 <list [30]>
#> 2 <list [30]>
#> 3 <list [30]>
#> 4 <list [26]>
#> 5 <list [30]>
#> 6 <list [30]>

This time the elements of user are a list of repositories that belong to that user. These are observations, so should become new rows, so we use unnest_longer() rather than unnest_wider():

repos <- repos %>% unnest_longer(repo)
repos
#> # A tibble: 176 x 1
#>   repo             
#>   <list>           
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> 6 <named list [68]>
#> # … with 170 more rows

Then we can use unnest_wider() or hoist():

repos %>% hoist(repo, 
  login = c("owner", "login"), 
  name = "name",
  homepage = "homepage",
  watchers = "watchers_count"
)
#> # A tibble: 176 x 5
#>   login       name        homepage watchers repo             
#>   <chr>       <chr>       <chr>       <int> <list>           
#> 1 gaborcsardi after        <NA>           5 <named list [65]>
#> 2 gaborcsardi argufy       <NA>          19 <named list [65]>
#> 3 gaborcsardi ask          <NA>           5 <named list [65]>
#> 4 gaborcsardi baseimports  <NA>           0 <named list [65]>
#> 5 gaborcsardi citest       <NA>           0 <named list [65]>
#> 6 gaborcsardi clisymbols  ""             18 <named list [65]>
#> # … with 170 more rows

Note the use of c("owner", "login"): this allows us to reach two levels deep inside of a list. An alternative approach would be to pull out just owner and then put each element of it in a column:

repos %>% 
  hoist(repo, owner = "owner") %>% 
  unnest_wider(owner)
#> # A tibble: 176 x 18
#>   login     id avatar_url gravatar_id url   html_url followers_url following_url
#>   <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
#> 1 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 2 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 3 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 4 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 5 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> 6 gabo… 660288 https://a… ""          http… https:/… https://api.… https://api.…
#> # … with 170 more rows, and 10 more variables: gists_url <chr>,
#> #   starred_url <chr>, subscriptions_url <chr>, organizations_url <chr>,
#> #   repos_url <chr>, events_url <chr>, received_events_url <chr>, type <chr>,
#> #   site_admin <lgl>, repo <list>

Instead of looking at the list and carefully thinking about whether it needs to become rows or columns, you can use unnest_auto(). It uses a handful of heuristics to figure out whether unnest_longer() or unnest_wider() is appropriate, and tells you about its reasoning.

tibble(repo = gh_repos) %>% 
  unnest_auto(repo) %>% 
  unnest_auto(repo)
#> Using `unnest_longer(repo)`; no element has names
#> Using `unnest_wider(repo)`; elements have 68 names in common
#> # A tibble: 176 x 67
#>        id name   full_name  owner  private html_url  description   fork  url    
#>     <int> <chr>  <chr>      <list> <lgl>   <chr>     <chr>         <lgl> <chr>  
#> 1  6.12e7 after  gaborcsar… <name… FALSE   https://… Run Code in … FALSE https:…
#> 2  4.05e7 argufy gaborcsar… <name… FALSE   https://… Declarative … FALSE https:…
#> 3  3.64e7 ask    gaborcsar… <name… FALSE   https://… Friendly CLI… FALSE https:…
#> 4  3.49e7 basei… gaborcsar… <name… FALSE   https://… Do we get wa… FALSE https:…
#> 5  6.16e7 citest gaborcsar… <name… FALSE   https://… Test R packa… TRUE  https:…
#> 6  3.39e7 clisy… gaborcsar… <name… FALSE   https://… Unicode symb… FALSE https:…
#> # … with 170 more rows, and 58 more variables: forks_url <chr>, keys_url <chr>,
#> #   collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
#> #   issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
#> #   branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
#> #   git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
#> #   languages_url <chr>, stargazers_url <chr>, contributors_url <chr>,
#> #   subscribers_url <chr>, subscription_url <chr>, commits_url <chr>,
#> #   git_commits_url <chr>, comments_url <chr>, issue_comment_url <chr>,
#> #   contents_url <chr>, compare_url <chr>, merges_url <chr>, archive_url <chr>,
#> #   downloads_url <chr>, issues_url <chr>, pulls_url <chr>,
#> #   milestones_url <chr>, notifications_url <chr>, labels_url <chr>,
#> #   releases_url <chr>, deployments_url <chr>, created_at <chr>,
#> #   updated_at <chr>, pushed_at <chr>, git_url <chr>, ssh_url <chr>,
#> #   clone_url <chr>, svn_url <chr>, size <int>, stargazers_count <int>,
#> #   watchers_count <int>, language <chr>, has_issues <lgl>,
#> #   has_downloads <lgl>, has_wiki <lgl>, has_pages <lgl>, forks_count <int>,
#> #   open_issues_count <int>, forks <int>, open_issues <int>, watchers <int>,
#> #   default_branch <chr>, homepage <chr>

Game of Thrones characters

got_chars has a similar structure to gh_users: it’s a list of named lists, where each element of the inner list describes some attribute of a GoT character. We start in the same way, first by creating a data frame and then by unnesting each component into a column:

chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 x 1
#>   char             
#>   <list>           
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>
#> 4 <named list [18]>
#> 5 <named list [18]>
#> 6 <named list [18]>
#> # … with 24 more rows

chars2 <- chars %>% unnest_wider(char)
chars2
#> # A tibble: 30 x 18
#>   url         id name   gender culture born   died   alive titles aliases father
#>   <chr>    <int> <chr>  <chr>  <chr>   <chr>  <chr>  <lgl> <list> <list>  <chr> 
#> 1 https:/…  1022 Theon… Male   "Ironb… "In 2… ""     TRUE  <chr … <chr [… ""    
#> 2 https:/…  1052 Tyrio… Male   ""      "In 2… ""     TRUE  <chr … <chr [… ""    
#> 3 https:/…  1074 Victa… Male   "Ironb… "In 2… ""     TRUE  <chr … <chr [… ""    
#> 4 https:/…  1109 Will   Male   ""      ""     "In 2… FALSE <chr … <chr [… ""    
#> 5 https:/…  1166 Areo … Male   "Norvo… "In 2… ""     TRUE  <chr … <chr [… ""    
#> 6 https:/…  1267 Chett  Male   ""      "At H… "In 2… FALSE <chr … <chr [… ""    
#> # … with 24 more rows, and 7 more variables: mother <chr>, spouse <chr>,
#> #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
#> #   playedBy <list>

This is more complex than gh_users because some component of char are themselves a list, giving us a collection of list-columns:

chars2 %>% select_if(is.list)
#> # A tibble: 30 x 7
#>   titles    aliases    allegiances books     povBooks  tvSeries  playedBy 
#>   <list>    <list>     <list>      <list>    <list>    <list>    <list>   
#> 1 <chr [3]> <chr [4]>  <chr [1]>   <chr [3]> <chr [2]> <chr [6]> <chr [1]>
#> 2 <chr [2]> <chr [11]> <chr [1]>   <chr [2]> <chr [4]> <chr [6]> <chr [1]>
#> 3 <chr [2]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [1]> <chr [1]>
#> 4 <chr [1]> <chr [1]>  <NULL>      <chr [1]> <chr [1]> <chr [1]> <chr [1]>
#> 5 <chr [1]> <chr [1]>  <chr [1]>   <chr [3]> <chr [2]> <chr [2]> <chr [1]>
#> 6 <chr [1]> <chr [1]>  <NULL>      <chr [2]> <chr [1]> <chr [1]> <chr [1]>
#> # … with 24 more rows

What you do next will depend on the purposes of the analysis. Maybe you want a row for every book and TV series that the character appears in:

chars2 %>% 
  select(name, books, tvSeries) %>% 
  pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>% 
  unnest_longer(value)
#> # A tibble: 180 x 3
#>   name          media    value            
#>   <chr>         <chr>    <chr>            
#> 1 Theon Greyjoy books    A Game of Thrones
#> 2 Theon Greyjoy books    A Storm of Swords
#> 3 Theon Greyjoy books    A Feast for Crows
#> 4 Theon Greyjoy tvSeries Season 1         
#> 5 Theon Greyjoy tvSeries Season 2         
#> 6 Theon Greyjoy tvSeries Season 3         
#> # … with 174 more rows

Or maybe you want to build a table that lets you match title to name:

chars2 %>% 
  select(name, title = titles) %>% 
  unnest_longer(title)
#> # A tibble: 60 x 2
#>   name              title                                               
#>   <chr>             <chr>                                               
#> 1 Theon Greyjoy     Prince of Winterfell                                
#> 2 Theon Greyjoy     Captain of Sea Bitch                                
#> 3 Theon Greyjoy     Lord of the Iron Islands (by law of the green lands)
#> 4 Tyrion Lannister  Acting Hand of the King (former)                    
#> 5 Tyrion Lannister  Master of Coin (former)                             
#> 6 Victarion Greyjoy Lord Captain of the Iron Fleet                      
#> # … with 54 more rows

(Note that the empty titles ("") are due to an infelicity in the input got_chars: ideally people without titles would have a title vector of length 0, not a title vector of length 1 containing an empty string.)

Again, we could rewrite using unnest_auto(). This is convenient for exploration, but I wouldn’t rely on it in the long term - unnest_auto() has the undesirable property that it will always succeed. That means if your data structure changes, unnest_auto() will continue to work, but might give very different output that causes cryptic failures from downstream functions.

tibble(char = got_chars) %>% 
  unnest_auto(char) %>% 
  select(name, title = titles) %>% 
  unnest_auto(title)
#> Using `unnest_wider(char)`; elements have 18 names in common
#> Using `unnest_longer(title)`; no element has names
#> # A tibble: 60 x 2
#>   name              title                                               
#>   <chr>             <chr>                                               
#> 1 Theon Greyjoy     Prince of Winterfell                                
#> 2 Theon Greyjoy     Captain of Sea Bitch                                
#> 3 Theon Greyjoy     Lord of the Iron Islands (by law of the green lands)
#> 4 Tyrion Lannister  Acting Hand of the King (former)                    
#> 5 Tyrion Lannister  Master of Coin (former)                             
#> 6 Victarion Greyjoy Lord Captain of the Iron Fleet                      
#> # … with 54 more rows

Geocoding with google

Next we’ll tackle a more complex form of data that comes from Google’s geocoding service. It’s against the terms of service to cache this data, so I first write a very simple wrapper around the API. This relies on having an Google maps API key stored in an environment; if that’s not available these code chunks won’t be run.

has_key <- !identical(Sys.getenv("GOOGLE_MAPS_API_KEY"), "")
if (!has_key) {
  message("No Google Maps API key found; code chunks will not be run")
}

# https://developers.google.com/maps/documentation/geocoding
geocode <- function(address, api_key = Sys.getenv("GOOGLE_MAPS_API_KEY")) {
  url <- "https://maps.googleapis.com/maps/api/geocode/json"
  url <- paste0(url, "?address=", URLencode(address), "&key=", api_key)

  jsonlite::read_json(url)
}

The list that this function returns is quite complex:

houston <- geocode("Houston TX")
str(houston)
#> List of 2
#>  $ results:List of 1
#>   ..$ :List of 5
#>   .. ..$ address_components:List of 4
#>   .. .. ..$ :List of 3
#>   .. .. .. ..$ long_name : chr "Houston"
#>   .. .. .. ..$ short_name: chr "Houston"
#>   .. .. .. ..$ types     :List of 2
#>   .. .. .. .. ..$ : chr "locality"
#>   .. .. .. .. ..$ : chr "political"
#>   .. .. ..$ :List of 3
#>   .. .. .. ..$ long_name : chr "Harris County"
#>   .. .. .. ..$ short_name: chr "Harris County"
#>   .. .. .. ..$ types     :List of 2
#>   .. .. .. .. ..$ : chr "administrative_area_level_2"
#>   .. .. .. .. ..$ : chr "political"
#>   .. .. ..$ :List of 3
#>   .. .. .. ..$ long_name : chr "Texas"
#>   .. .. .. ..$ short_name: chr "TX"
#>   .. .. .. ..$ types     :List of 2
#>   .. .. .. .. ..$ : chr "administrative_area_level_1"
#>   .. .. .. .. ..$ : chr "political"
#>   .. .. ..$ :List of 3
#>   .. .. .. ..$ long_name : chr "United States"
#>   .. .. .. ..$ short_name: chr "US"
#>   .. .. .. ..$ types     :List of 2
#>   .. .. .. .. ..$ : chr "country"
#>   .. .. .. .. ..$ : chr "political"
#>   .. ..$ formatted_address : chr "Houston, TX, USA"
#>   .. ..$ geometry          :List of 4
#>   .. .. ..$ bounds       :List of 2
#>   .. .. .. ..$ northeast:List of 2
#>   .. .. .. .. ..$ lat: num 30.1
#>   .. .. .. .. ..$ lng: num -95
#>   .. .. .. ..$ southwest:List of 2
#>   .. .. .. .. ..$ lat: num 29.5
#>   .. .. .. .. ..$ lng: num -95.8
#>   .. .. ..$ location     :List of 2
#>   .. .. .. ..$ lat: num 29.8
#>   .. .. .. ..$ lng: num -95.4
#>   .. .. ..$ location_type: chr "APPROXIMATE"
#>   .. .. ..$ viewport     :List of 2
#>   .. .. .. ..$ northeast:List of 2
#>   .. .. .. .. ..$ lat: num 30.1
#>   .. .. .. .. ..$ lng: num -95
#>   .. .. .. ..$ southwest:List of 2
#>   .. .. .. .. ..$ lat: num 29.5
#>   .. .. .. .. ..$ lng: num -95.8
#>   .. ..$ place_id          : chr "ChIJAYWNSLS4QIYROwVl894CDco"
#>   .. ..$ types             :List of 2
#>   .. .. ..$ : chr "locality"
#>   .. .. ..$ : chr "political"
#>  $ status : chr "OK"

Fortunately, we can attack the problem step by step with tidyr functions. To make the problem a bit harder (!) and more realistic, I’ll start by geocoding a few cities:

city <- c("Houston", "LA", "New York", "Chicago", "Springfield")
city_geo <- purrr::map(city, geocode)

I’ll put these results in a tibble, next to the original city name:

loc <- tibble(city = city, json = city_geo)
loc
#> # A tibble: 5 x 2
#>   city        json            
#>   <chr>       <list>          
#> 1 Houston     <named list [2]>
#> 2 LA          <named list [2]>
#> 3 New York    <named list [2]>
#> 4 Chicago     <named list [2]>
#> 5 Springfield <named list [2]>

The first level contains components status and result, which we can reveal with unnest_wider():

loc %>%
  unnest_wider(json)
#> # A tibble: 5 x 3
#>   city        results    status
#>   <chr>       <list>     <chr> 
#> 1 Houston     <list [1]> OK    
#> 2 LA          <list [1]> OK    
#> 3 New York    <list [1]> OK    
#> 4 Chicago     <list [1]> OK    
#> 5 Springfield <list [3]> OK

Notice that results is a list of lists. Most of the cities have 1 element (representing a unique match from the geocoding API), but Springfield has two. We can pull these out into separate rows with unnest_longer():

loc %>%
  unnest_wider(json) %>% 
  unnest_longer(results)
#> # A tibble: 7 x 3
#>   city        results          status
#>   <chr>       <list>           <chr> 
#> 1 Houston     <named list [5]> OK    
#> 2 LA          <named list [5]> OK    
#> 3 New York    <named list [5]> OK    
#> 4 Chicago     <named list [5]> OK    
#> 5 Springfield <named list [5]> OK    
#> 6 Springfield <named list [5]> OK    
#> # … with 1 more row

Now these all have the same components, as revealed by unnest_wider():

loc %>%
  unnest_wider(json) %>% 
  unnest_longer(results) %>% 
  unnest_wider(results)
#> # A tibble: 7 x 7
#>   city    address_componen… formatted_address geometry  place_id    types status
#>   <chr>   <list>            <chr>             <list>    <chr>       <lis> <chr> 
#> 1 Houston <list [4]>        Houston, TX, USA  <named l… ChIJAYWNSL… <lis… OK    
#> 2 LA      <list [4]>        Los Angeles, CA,… <named l… ChIJE9on3F… <lis… OK    
#> 3 New Yo… <list [3]>        New York, NY, USA <named l… ChIJOwg_06… <lis… OK    
#> 4 Chicago <list [4]>        Chicago, IL, USA  <named l… ChIJ7cv00D… <lis… OK    
#> 5 Spring… <list [4]>        Springfield, MA,… <named l… ChIJH7jkCC… <lis… OK    
#> 6 Spring… <list [5]>        Springfield, MO,… <named l… ChIJP5jIRf… <lis… OK    
#> # … with 1 more row

We can find the lat and lon coordinates by unnesting geometry:

loc %>%
  unnest_wider(json) %>% 
  unnest_longer(results) %>% 
  unnest_wider(results) %>% 
  unnest_wider(geometry)
#> # A tibble: 7 x 10
#>   city  address_compone… formatted_addre… bounds location location_type viewport
#>   <chr> <list>           <chr>            <list> <list>   <chr>         <list>  
#> 1 Hous… <list [4]>       Houston, TX, USA <name… <named … APPROXIMATE   <named …
#> 2 LA    <list [4]>       Los Angeles, CA… <name… <named … APPROXIMATE   <named …
#> 3 New … <list [3]>       New York, NY, U… <name… <named … APPROXIMATE   <named …
#> 4 Chic… <list [4]>       Chicago, IL, USA <name… <named … APPROXIMATE   <named …
#> 5 Spri… <list [4]>       Springfield, MA… <name… <named … APPROXIMATE   <named …
#> 6 Spri… <list [5]>       Springfield, MO… <name… <named … APPROXIMATE   <named …
#> # … with 1 more row, and 3 more variables: place_id <chr>, types <list>,
#> #   status <chr>

And then location:

loc %>%
  unnest_wider(json) %>%
  unnest_longer(results) %>%
  unnest_wider(results) %>%
  unnest_wider(geometry) %>%
  unnest_wider(location)
#> # A tibble: 7 x 11
#>   city    address_componen… formatted_addre… bounds     lat    lng location_type
#>   <chr>   <list>            <chr>            <list>   <dbl>  <dbl> <chr>        
#> 1 Houston <list [4]>        Houston, TX, USA <named …  29.8  -95.4 APPROXIMATE  
#> 2 LA      <list [4]>        Los Angeles, CA… <named …  34.1 -118.  APPROXIMATE  
#> 3 New Yo… <list [3]>        New York, NY, U… <named …  40.7  -74.0 APPROXIMATE  
#> 4 Chicago <list [4]>        Chicago, IL, USA <named …  41.9  -87.6 APPROXIMATE  
#> 5 Spring… <list [4]>        Springfield, MA… <named …  42.1  -72.6 APPROXIMATE  
#> 6 Spring… <list [5]>        Springfield, MO… <named …  37.2  -93.3 APPROXIMATE  
#> # … with 1 more row, and 4 more variables: viewport <list>, place_id <chr>,
#> #   types <list>, status <chr>

Again, unnest_auto() makes this simpler with the small risk of failing in unexpected ways if the input structure changes:

loc %>%
  unnest_auto(json) %>%
  unnest_auto(results) %>%
  unnest_auto(results) %>%
  unnest_auto(geometry) %>%
  unnest_auto(location)
#> Using `unnest_wider(json)`; elements have 2 names in common
#> Using `unnest_longer(results)`; no element has names
#> Using `unnest_wider(results)`; elements have 5 names in common
#> Using `unnest_wider(geometry)`; elements have 4 names in common
#> Using `unnest_wider(location)`; elements have 2 names in common
#> # A tibble: 7 x 11
#>   city    address_componen… formatted_addre… bounds     lat    lng location_type
#>   <chr>   <list>            <chr>            <list>   <dbl>  <dbl> <chr>        
#> 1 Houston <list [4]>        Houston, TX, USA <named …  29.8  -95.4 APPROXIMATE  
#> 2 LA      <list [4]>        Los Angeles, CA… <named …  34.1 -118.  APPROXIMATE  
#> 3 New Yo… <list [3]>        New York, NY, U… <named …  40.7  -74.0 APPROXIMATE  
#> 4 Chicago <list [4]>        Chicago, IL, USA <named …  41.9  -87.6 APPROXIMATE  
#> 5 Spring… <list [4]>        Springfield, MA… <named …  42.1  -72.6 APPROXIMATE  
#> 6 Spring… <list [5]>        Springfield, MO… <named …  37.2  -93.3 APPROXIMATE  
#> # … with 1 more row, and 4 more variables: viewport <list>, place_id <chr>,
#> #   types <list>, status <chr>

We could also just look at the first address for each city:

loc %>%
  unnest_wider(json) %>%
  hoist(results, first_result = 1) %>%
  unnest_wider(first_result) %>%
  unnest_wider(geometry) %>%
  unnest_wider(location)
#> # A tibble: 5 x 12
#>   city    address_componen… formatted_addre… bounds     lat    lng location_type
#>   <chr>   <list>            <chr>            <list>   <dbl>  <dbl> <chr>        
#> 1 Houston <list [4]>        Houston, TX, USA <named …  29.8  -95.4 APPROXIMATE  
#> 2 LA      <list [4]>        Los Angeles, CA… <named …  34.1 -118.  APPROXIMATE  
#> 3 New Yo… <list [3]>        New York, NY, U… <named …  40.7  -74.0 APPROXIMATE  
#> 4 Chicago <list [4]>        Chicago, IL, USA <named …  41.9  -87.6 APPROXIMATE  
#> 5 Spring… <list [4]>        Springfield, MA… <named …  42.1  -72.6 APPROXIMATE  
#> # … with 5 more variables: viewport <list>, place_id <chr>, types <list>,
#> #   results <list>, status <chr>

Or use hoist() to dive deeply to get directly to lat and lng:

loc %>%
  hoist(json,
    lat = list("results", 1, "geometry", "location", "lat"),
    lng = list("results", 1, "geometry", "location", "lng")
  )
#> # A tibble: 5 x 4
#>   city          lat    lng json            
#>   <chr>       <dbl>  <dbl> <list>          
#> 1 Houston      29.8  -95.4 <named list [2]>
#> 2 LA           34.1 -118.  <named list [2]>
#> 3 New York     40.7  -74.0 <named list [2]>
#> 4 Chicago      41.9  -87.6 <named list [2]>
#> 5 Springfield  42.1  -72.6 <named list [2]>

Sharla Gelfand’s discography

We’ll finish off with the most complex list, from Sharla Gelfand’s discography. We’ll start the usual way: putting the list into a single column data frame, and then widening so each component is a column. I also parse the date_added column into a real date-time1.

discs <- tibble(disc = discog) %>% 
  unnest_wider(disc) %>% 
  mutate(date_added = as.POSIXct(strptime(date_added, "%Y-%m-%dT%H:%M:%S"))) 
discs
#> # A tibble: 155 x 5
#>   instance_id date_added          basic_information       id rating
#>         <int> <dttm>              <list>               <int>  <int>
#> 1   354823933 2019-02-16 17:48:59 <named list [11]>  7496378      0
#> 2   354092601 2019-02-13 14:13:11 <named list [11]>  4490852      0
#> 3   354091476 2019-02-13 14:07:23 <named list [11]>  9827276      0
#> 4   351244906 2019-02-02 11:39:58 <named list [11]>  9769203      0
#> 5   351244801 2019-02-02 11:39:37 <named list [11]>  7237138      0
#> 6   351052065 2019-02-01 20:40:53 <named list [11]> 13117042      0
#> # … with 149 more rows

At this level, we see information about when each disc was added to Sharla’s discography, not any information about the disc itself. To do that we need to widen the basic_information column:

discs %>% unnest_wider(basic_information)
#> Error: Names must be unique.
#> x These names are duplicated:
#>   * "id" at locations 6 and 14.
#> ℹ Use argument `names_repair` to specify repair strategy.

Unfortunately that fails because there’s an id column inside basic_information. We can quickly see what’s going on by setting names_repair = "unique":

discs %>% unnest_wider(basic_information, names_repair = "unique")
#> New names:
#> * id -> id...6
#> * id -> id...14
#> # A tibble: 155 x 15
#>   instance_id date_added          labels   year artists  id...6 thumb     title 
#>         <int> <dttm>              <list>  <int> <list>    <int> <chr>     <chr> 
#> 1   354823933 2019-02-16 17:48:59 <list …  2015 <list [… 7.50e6 https://… Demo  
#> 2   354092601 2019-02-13 14:13:11 <list …  2013 <list [… 4.49e6 https://… Obser…
#> 3   354091476 2019-02-13 14:07:23 <list …  2017 <list [… 9.83e6 https://… I     
#> 4   351244906 2019-02-02 11:39:58 <list …  2017 <list [… 9.77e6 https://… Oído …
#> 5   351244801 2019-02-02 11:39:37 <list …  2015 <list [… 7.24e6 https://… A Cat…
#> 6   351052065 2019-02-01 20:40:53 <list …  2019 <list [… 1.31e7 https://… Tashme
#> # … with 149 more rows, and 7 more variables: formats <list>,
#> #   cover_image <chr>, resource_url <chr>, master_id <int>, master_url <chr>,
#> #   id...14 <int>, rating <int>

The problem is that basic_information repeats the id column that’s also stored at the top-level, so we can just drop that:

discs %>% 
  select(!id) %>% 
  unnest_wider(basic_information)
#> # A tibble: 155 x 14
#>   instance_id date_added          labels   year artists      id thumb     title 
#>         <int> <dttm>              <list>  <int> <list>    <int> <chr>     <chr> 
#> 1   354823933 2019-02-16 17:48:59 <list …  2015 <list [… 7.50e6 https://… Demo  
#> 2   354092601 2019-02-13 14:13:11 <list …  2013 <list [… 4.49e6 https://… Obser…
#> 3   354091476 2019-02-13 14:07:23 <list …  2017 <list [… 9.83e6 https://… I     
#> 4   351244906 2019-02-02 11:39:58 <list …  2017 <list [… 9.77e6 https://… Oído …
#> 5   351244801 2019-02-02 11:39:37 <list …  2015 <list [… 7.24e6 https://… A Cat…
#> 6   351052065 2019-02-01 20:40:53 <list …  2019 <list [… 1.31e7 https://… Tashme
#> # … with 149 more rows, and 6 more variables: formats <list>,
#> #   cover_image <chr>, resource_url <chr>, master_id <int>, master_url <chr>,
#> #   rating <int>

Alternatively, we could use hoist():

discs %>% 
  hoist(basic_information,
    title = "title",
    year = "year",
    label = list("labels", 1, "name"),
    artist = list("artists", 1, "name")
  )
#> # A tibble: 155 x 9
#>   instance_id date_added          title     year label  artist  basic_informati…
#>         <int> <dttm>              <chr>    <int> <chr>  <chr>   <list>          
#> 1   354823933 2019-02-16 17:48:59 Demo      2015 Tobi … Mollot  <named list [9]>
#> 2   354092601 2019-02-13 14:13:11 Observa…  2013 La Vi… Una Bè… <named list [9]>
#> 3   354091476 2019-02-13 14:07:23 I         2017 La Vi… S.H.I.… <named list [9]>
#> 4   351244906 2019-02-02 11:39:58 Oído Ab…  2017 La Vi… Rata N… <named list [9]>
#> 5   351244801 2019-02-02 11:39:37 A Cat's…  2015 Kator… Ivy (1… <named list [9]>
#> 6   351052065 2019-02-01 20:40:53 Tashme    2019 High … Tashme  <named list [9]>
#> # … with 149 more rows, and 2 more variables: id <int>, rating <int>

Here I quickly extract the name of the first label and artist by indexing deeply into the nested list.

A more systematic approach would be to create separate tables for artist and label:

discs %>% 
  hoist(basic_information, artist = "artists") %>% 
  select(disc_id = id, artist) %>% 
  unnest_longer(artist) %>% 
  unnest_wider(artist)
#> # A tibble: 167 x 8
#>    disc_id join  name          anv   tracks role  resource_url                id
#>      <int> <chr> <chr>         <chr> <chr>  <chr> <chr>                    <int>
#> 1  7496378 ""    Mollot        ""    ""     ""    https://api.discogs.co… 4.62e6
#> 2  4490852 ""    Una Bèstia I… ""    ""     ""    https://api.discogs.co… 3.19e6
#> 3  9827276 ""    S.H.I.T. (3)  ""    ""     ""    https://api.discogs.co… 2.77e6
#> 4  9769203 ""    Rata Negra    ""    ""     ""    https://api.discogs.co… 4.28e6
#> 5  7237138 ""    Ivy (18)      ""    ""     ""    https://api.discogs.co… 3.60e6
#> 6 13117042 ""    Tashme        ""    ""     ""    https://api.discogs.co… 5.21e6
#> # … with 161 more rows

discs %>% 
  hoist(basic_information, format = "formats") %>% 
  select(disc_id = id, format) %>% 
  unnest_longer(format) %>% 
  unnest_wider(format) %>% 
  unnest_longer(descriptions)
#> # A tibble: 281 x 5
#>   disc_id descriptions text  name     qty  
#>     <int> <chr>        <chr> <chr>    <chr>
#> 1 7496378 "Numbered"   Black Cassette 1    
#> 2 4490852 "LP"         <NA>  Vinyl    1    
#> 3 9827276 "7\""        <NA>  Vinyl    1    
#> 4 9827276 "45 RPM"     <NA>  Vinyl    1    
#> 5 9827276 "EP"         <NA>  Vinyl    1    
#> 6 9769203 "LP"         <NA>  Vinyl    1    
#> # … with 275 more rows

Then you could join these back on to the original dataset as needed.


  1. I’d normally use readr::parse_datetime() or lubridate::ymd_hms(), but I can’t here because it’s a vignette and I don’t want to add a dependency to tidyr just to simplify one example.↩︎