This vignette showcases how to utilize the metadata information by drilling into North American Industry Classification System (NAICS) categories and the hierarchy metadata that is downloaded with tables that have hierarchical data.

Accessing table data

We see that data set come with three different measures and 307 different NAICS values. Let’s load the data and focus on just “Chained (2012) dollars”.

Normalizing variables

There are several annoying characteristics to the raw data downloaded from Statistics Canada which we address with a convenience function called normalize_cansim_values(). By default, downloaded CANSIM data contains a VALUE and a SCALAR_FACTOR field indicating whether the value displayed is a number, a percentage, a dollar value, a million set, or something else. This can be annoying to work with so the normalize_cansim_values() function will apply the appropriate conversion based on the scale variable so that the VALUE value is more straight-forward to work with. The default data also includes a character string REF_DATE field, which the normalize_cansim_values() function will attempt to convert into a standard date-class variable.

data <- data %>% 
  normalize_cansim_values(strip_classification_code = TRUE)

Taking advantage of metadata

This table includes all the different levels of NAICS categories in one dimension. This makes working with the data at that level rather cumbersome when we are often only interested in specific sub-categories. The internal hierarchy can help with that. Let’s first get an overview of the data. We can also use this to easily compute shares instead of totals.

We can extract the hierarchy using the built-in convenience function categories_for_level which takes in a cansim-package retrieved data table object with metadata as input and requires a field from which to extract categories from as well as a level indicating the target depth level of the hierarchy that we wish to extract.

Hierarchies in more detail

There are hundreds of NAICS codes here which is too many to make sense of at the same time. We can use categories_for_level again to reduce our NAICS codes just to the first sub-level which represents the industry groups. We can call this subset cut_data. (Note that the NAICS data also includes composite groups of industries, something like a level 0.5 hierarchy, which are prefixed with a “T” and which we want to remove as well.)

cut_data <- data %>% filter(
  !grepl("T\\d+",`Classification Code for North American Industry Classification System (NAICS)`),
  `North American Industry Classification System (NAICS)` %in% 
    categories_for_level(.,"North American Industry Classification System (NAICS)",1))

# How many are NAICS categories left? 
n <- length(cut_data$`North American Industry Classification System (NAICS)` %>% unique)

There are still 22 level 1 categories, too many to sensibly visualize at the same time. We can use dplyr functions to identify the top categories and group the rest so that we can have a plot that is easier to understand.

# Specify which regions and period we want to look at 
regions = "British Columbia"
period = "2017-01-01"

# Select the top-8 categories for our reference region and period
top_categories <- cut_data %>% 
  filter(GEO %in% regions, Date == period) %>% 
  top_n(8,Share) %>% 
  pull("North American Industry Classification System (NAICS)")

# Group remaining categories together and prepare data for plot
plot_data <- cut_data %>% 
  mutate(NAICS = ifelse(`North American Industry Classification System (NAICS)` %in% top_categories,`North American Industry Classification System (NAICS)`,"Rest")) %>%
  select(Date, GEO, NAICS, VALUE, Share) %>%
  group_by(Date, GEO, NAICS) %>%
  summarise(VALUE = sum(VALUE, na.rm = TRUE),
            Share = sum(Share, na.rm = TRUE)) %>% 
  ungroup()

With the data prepared, the last step is putting it all together into a visualization using ggplot2. We can then see if any further adjustments are required.

library(ggplot2)

ggplot(plot_data %>% 
         filter(GEO %in% regions), 
       aes(x = Date, y = Share, fill = NAICS)) +
  geom_area(position="stack") +
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  theme(legend.position = "bottom",legend.direction ="vertical") +
  guides(fill=guide_legend(ncol = 3)) +
  labs(title="Gross domestic product (GDP) at basic prices",
       subtitle=selected_value,
       x="", fill = "", 
       caption=paste0("CANSIM ", table_id))

Let’s have a closer look at the “Real estate and rental and leasing” and “Construction” categories. Once again we turn to the categories_for_level function to make grabbing the sub-categories an easier process.

real_construction <- c("Construction","Real estate and rental and leasing")

# Get the NAICS hierarchy codes just for these categories
rrl_hierarchy <- data %>% 
  filter(`North American Industry Classification System (NAICS)` %in% real_construction) %>% 
  pull("Hierarchy for North American Industry Classification System (NAICS)") %>% 
  unique

# Filter out all sub-categories for Real Estate. 
# The paste with | trick ensures that we grepl for all matches. 
rrl_data <- data %>% 
  filter(grepl(paste(rrl_hierarchy,collapse="|"),`Hierarchy for North American Industry Classification System (NAICS)`))

# Ensure we only retain the NAICS leaves and none of the aggregate subcategories
rrl_data <- rrl_data %>% 
  filter(
  `North American Industry Classification System (NAICS)` %in% 
    categories_for_level(.,"North American Industry Classification System (NAICS)")) %>%
  rename(NAICS=`North American Industry Classification System (NAICS)`)

# Plot with labels from our original selections
ggplot(rrl_data %>% filter(GEO %in% regions),
       aes(x = Date, y = Share, fill = NAICS)) +
  geom_area(position = "stack") +
  scale_y_continuous(labels = scales::percent) +
  theme_bw() +
  theme(legend.position = "bottom",legend.direction ="vertical") +
  guides(fill=guide_legend(ncol=1)) +
  labs(title="Gross domestic product (GDP) at basic prices",
       subtitle=paste0(regions,", ", selected_value),
       x="",
       caption=paste0("CANSIM ", table_id))

We observe in the resulting chart that the largest contributors to GDP in this sector in British Columbia are Owner-occupied dwellings (imputed rent) and Lessors of Real estate (rent), followed by Residential building construction.