Census_xtab.Rmd
To import a Statistics Canada Census cross tabulation, locate the cross tabulation you are interested in on the StatCan website and take note of the code for the table, as well as the url for the entire table download.
For example, for the cross tabulation of household income statistics by structural type of dwelling, we locate the corresponding webpage and read off the code as “98-400-X2016101”. Under the download tab we also copy the link for the entire table CSV download.
Calling the get_sqlite_xtab
function with these two
arguments will then download the data on the first invocation and sort
it into an sqlite database and open a connection to that database.
code <- "98-400-X2016101"
url <- "https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/CompDataDownload.cfm?LANG=E&PID=110210&OFT=CSV"
xtab_data <- get_sqlite_xtab(code=code,url = url, cache_dir = tempdir())
#> Downloading data ...
#> Unpacking data ...
#> Importing xtab ...
#> [1] "Indexing xtab..."
The data can then be accessed using the dbplyr
syntax.
For example to see all the available columns we can load in the first
row (don’t forget to call collect()
to execute the database
fetch) and list the column names.
xtab_data %>% head(1) %>% collect() %>% names()
#> [1] "CENSUS_YEAR"
#> [2] "GEO_CODE (POR)"
#> [3] "GEO_LEVEL"
#> [4] "GEO_NAME"
#> [5] "GNR"
#> [6] "DATA_QUALITY_FLAG"
#> [7] "ALT_GEO_CODE"
#> [8] "Structural type of dwelling"
#> [9] "Member ID: Structural type of dwelling"
#> [10] "Household type including census family structure"
#> [11] "Member ID: Household type including census family structure"
#> [12] "Dim: Household income statistics (3): Member ID: [1]: Total - Household income statistics"
#> [13] "Dim: Household income statistics (3): Member ID: [2]: Median total income of households ($)"
#> [14] "Dim: Household income statistics (3): Member ID: [3]: Median after-tax income of households ($)"
If we are only interested in people residing in Metro Vancouver in 2016 and want to learn about the labour force status of movers, we can filter out the other variables in our database call and just select the ones we are interested in.
The cross tabulations posted on the Statistics Canada website tend to be relatively clean (compared to custom requests, which are usually littered with inconsistent spelling and other problems), and we can infer the names of the “Total” categories by just prepending “Total -” to the column name. In this case that works for all but one of the filtered columns, which can be a bit of a pain to chase down.
Also, we usually want to pivot the last columns from wide to long
form, the standardize_xtab
function takes care of that.
vancouver_dwelling_income <- xtab_data %>%
filter(`GEO_CODE (POR)`=="933",
`Household type including census family structure` ==
"Total - Household type including census family structure") %>%
select(GeoUID=`GEO_CODE (POR)`, Name=GEO_NAME,`Structural type of dwelling`, matches("^Dim: ")) %>%
collect_and_normalize_xtab(disconnect=TRUE)
vancouver_dwelling_income
#> # A tibble: 30 × 5
#> GeoUID Name `Structural type of dwelling` House…¹ Value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 933 Vancouver Total - Structural type of dwelling Total … 960895
#> 2 933 Vancouver Single-detached house Total … 282360
#> 3 933 Vancouver Apartment in a building that has five or mor… Total … 160060
#> 4 933 Vancouver Other attached dwelling Total … 513900
#> 5 933 Vancouver Semi-detached house Total … 20530
#> 6 933 Vancouver Row house Total … 93415
#> 7 933 Vancouver Apartment or flat in a duplex Total … 156445
#> 8 933 Vancouver Apartment in a building that has fewer than … Total … 242200
#> 9 933 Vancouver Other single-attached house Total … 1310
#> 10 933 Vancouver Movable dwelling Total … 4580
#> # … with 20 more rows, and abbreviated variable name
#> # ¹`Household income statistics`
#> # ℹ Use `print(n = ...)` to see more rows
Armed with this we can look at the labour force status by Mobility status 5 years ago.
vancouver_dwelling_income %>%
filter(`Household income statistics`=="Median total income of households ($)",
`Structural type of dwelling` != "Other attached dwelling") %>%
ggplot(aes(x=`Structural type of dwelling`,y=Value)) +
geom_bar(stat="identity",fill="steelblue") +
scale_y_continuous(labels=scales::dollar) +
scale_fill_brewer(palette="Set1") +
coord_flip() +
labs(title="Metro Vancouver median household income",
y="Median household income",
caption="StatCan Census 2016, 98-400-X2016101")