Analyzing Canadian Demographic and Housing Data

Building skills and community to analyze Canadian demographic and housing data

Author

Affiliation

Jens von Bergmann

MountainMath Software and Analytics

Published

April 1, 2022

Preface

This book is intended for people interested in learning how to access, process, analyze, and visualize Canadian demographic, economic, and housing data using R. The target audience ranges from individuals wanting to understand their environment through data, to community activists and community groups seeking to introduce or solidify data-based approached into their work, to journalists wanting to enrich their reporting with data or aim to incorporate their own descriptive data analysis, to non-profits or people involved in policy who are looking for data-based answers to their questions.

The key prerequisite for this book is a keen interest in using data to help understand how demographics, economic indicators, housing, and transportation, are reflected in and shape cities and rural areas in Canada. Prior knowledge of R is not necessary, but may be beneficial.

Canada has high quality demographic, economic and housing data. While significant data gaps exist, the available data often remains under-utilized in policy and planning analyses. This under-utilization is accentuated by analysis frequently getting done in silos, relying out data that’s already outdated at the time of release, and lack of transparency in the analysis.

In this book aims to help close the gap in high-quality data analysis by

expanding the group of people doing analysis, increasing the perspectives and interests people bring to data analysis,
providing guidance on data analysis workflows to increase quality and clarity of data analysis, and
putting a high emphasis on reproducible and adaptable work flows to ensure the analysis is transparent, can easily be updated as new data becomes available, and can be tweaked or adapted to address related questions.

Under construction

This book provides a basic introduction into how to analyze Canadian data in R and can be used as a standalone resource to cover basic data analysis and visualization workflows in R, as well as be a comprehensive introduction into Canadian data sources. At the same time, we view this as a resources that requires continuous updating as discover new gaps in data analysis and the needs of the Canadian data community changes.

We are planning to add to this book as we find time and come across good examples that are simple enough to slowly build skills, as well as interesting enough to be engaging and motivating to the reader. In this process the order of sections will change as we add new material, and we will come back and revise existing sections as we receive feedback from readers, which we encourage and is ideally submitted as a GitHub issue. When referring to this book we recommend to refer by section name rather than section number. Links in the online version of the book are based on section names and will be unaffected by re-numbering of sections.

Project based approach

We are taking a project based approach to teach through examples, with one project per section. Each project will be broken up into distinct steps. This standardized workflow acts as scaffolding for data analysis and helps ensure key components of analysis are adequately addressed.

Formulating the question. What is the question we are interested in? Asking a clear question will help focus our efforts and ensure that we don’t aimlessly trawl through data. This also involves being clear about the quantities of interest, as well as the target population that we seek to understand.
Identifying possible data sources. Here we try to identify data sources that can speak to our question. We will also take the time to read up on definitions and background concepts to better understand the data and prepare us for data analysis, and understand how well the concepts in the data match our original question from step 1.
Data acquisition. In this step we will import the data into our current working session. This could be as simple as an API call, or more complicated like scraping a table from the web, or involve even more complex techniques to acquire the data.
Data preparation. In this step we will reshape and filter the data to prepare it for analysis.
Analysis. This step could be as simple as computing percentages or even doing nothing, if the quantities we are interested in already come with the dataset, if our question can be answered by a simple descriptive analysis. In other cases, when our question is more complex, this step may be much more involved. The book will try to slowly build up analysis skills along the way, with increasing complexity of questions and required analysis.
Visualization. The final step in the analysis process is to visualize and communicate the results. In some cases this can be done via a table or a couple of paragraphs of text explaining the results, but in most cases it is useful to produce graphs or maps or even interactive visualizations to effectively communicate the results.
Interpretation. What’s left to wrap this up is to interpret the results. How does this answer our question, where does it fall short. What does this mean in the real-world context? What new questions emerge from this?

While we won’t always follow this step by step process to the letter, it will be our guiding principle throughout the book. Sometimes things won’t go so clean, where after the visualization step we notice that something looks off or is unexpected, and we may jump back up a couple of steps and add more data and redo parts of the analysis to better understand our data and how it speaks to our initial questions. We might even come to understand that our initial question was not helpful or was ill-posed, and we will come back to refine it.

This approach to data analysis leans in parts on (Lundberg, Johnson, and Stewart 2021), who describe a more rigorous framework how data analysis questions should get approached and is a great resource for expert users.

Goals

By taking this approach we have several goals in mind:

Provide guidance and guardrails for basic data analysis tasks.
Teach basic data literacy, appreciate definitions and quirks in the data.
Expose the world of Canadian data and increase accessibility.
Learn how data can be interpreted in different ways, and data and analysis is not necessarily “neutral”.
Learn how to effectively communicate results.
Learn how to adapt and leverage off of previous work to answer new questions.
Learn how to reproduce and critique data analysis.
Build a community around Canadian data, where people interested in similar questions, or people using the same data, can learn from each other.
Raise the level of understanding of Canadian data and data analysis so we are better equipped to tackle the problems Canada faces.
Stay motivated by using real world Canada-focused and (hopefully) interesting examples.

This is setting a very high goal for this book, and we are not sure we can achieve all of this. But we will try our best to be accessible and interesting as possible.

Why use R?

Most people reading this book will not have used R before, or only used it peripherally, maybe during a college course many years in the past. Instead, readers may be familiar with working through housing and demographic data in Excel or similar tools. Or making maps in QGIS or similar tools when dealing with spatial data. And the type of analysis outlined above that this book will teach can in general terms be accomplished using these tools.

But where tools like spreadsheets and desktop GIS fall short is in another important focus of this book: transparency, reproducibility, and adaptability.

An analysis in a spreadsheet or desktop GIS typically involves a lot of manual steps, the work is not reproducible without repeating these steps. We can’t easily inspect how the result was derived, the analysis lacks transparency. When we just compute a ratio or percentage this may not be problematic, but trying to understand how a more complex analysis was done in a spreadsheet easily turns into a nightmare. Analysis that involves a lot of manual steps is not auditable without putting in the work to repeat those manual steps.

But why does this matter? It’s always been this way, some experts produce analysis and produce a glossy paper to present the results. One can argue if this was an adequate modus operandi in the past, but we feel strongly that it isn’t in today’s world. The lines between experts and non-experts has become blurred, and the value we place on lived experience has increased relative to more formal expertise. We argue this places different demands on policy-relevant analysis, it needs to be open and transparent, in principle anyone should be able to understand how the analysis was done and the conclusions were reached. That’s where reproducibility and transparency come in. Additionally, it requires bringing up data analysis skills in the broader population, so that the ability to reproduce and critique an analysis in principle can be realized in practice.

The remaining reason for using R, adaptability, has likewise become increasingly important. The amount of data available to us has increased tremendously, but our collective ability to analyse data and extract information has not kept up. Doing analysis in R allows us to efficiently reuse previous analysis to perform a similar one. Or to build on previous analysis to deepen it. Which turbocharges our ability to do analysis, covering more ground and going deeper.

R is not the only framework to do this in, there are other options like python or julia. But we believe that R is best suited for people transitioning into this space, and we can rely on an existing ecosystem of packages to access and process Canadian data. People already proficient in python, julia, STATA, SAS or SPSS will have little difficulty translating what we do into their preferred framework, or dynamically switch back and forth between R, python, julia, or whatever other tools they prefer as needed and convenient.

Building a Canadian data community

Which brings us to our most ambitious goal, to help create a community around Canadian data analysis. When analysis is transparent, reproducible and adaptable people can piggy-back of each others work, reusing parts of analysis others have done and building and improving upon it. Or critiquing and correcting analysis, or taking it toward a different direction. A community that grows in their understanding of data, and a community using a shared set of tools to access and process Canadian data, enabling discussions to move forward instead of in circles. A community that builds up expertise from the bottom up.

The book tries to address both of these requirements for building a Canadian data community, a principled approach to data and data analysis, while introducing R as a common framework to work in hoping that the reader will come away with

better data literacy skills to understand and critique data analysis,
technical skills to reproduce and perform their own data analysis, and
a common tool set for acquiring, processing and analyzing Canadian data that facilitates collaborative practices.

# Preface {.unnumbered} This book is intended for people interested in learning how to access, process, analyze, and visualize Canadian demographic, economic, and housing data using R. The target audience ranges from individuals wanting to understand their environment through data, to community activists and community groups seeking to introduce or solidify data-based approached into their work, to journalists wanting to enrich their reporting with data or aim to incorporate their own descriptive data analysis, to non-profits or people involved in policy who are looking for data-based answers to their questions. The key prerequisite for this book is a keen interest in using data to help understand how demographics, economic indicators, housing, and transportation, are reflected in and shape cities and rural areas in Canada. Prior knowledge of R is not necessary, but may be beneficial. Canada has high quality demographic, economic and housing data. While significant data gaps exist, the available data often remains under-utilized in policy and planning analyses. This under-utilization is accentuated by analysis frequently getting done in silos, relying out data that's already outdated at the time of release, and lack of transparency in the analysis. In this book aims to help close the gap in high-quality data analysis by - expanding the group of people doing analysis, increasing the perspectives and interests people bring to data analysis, - providing guidance on data analysis workflows to increase quality and clarity of data analysis, and - putting a high emphasis on reproducible and adaptable work flows to ensure the analysis is transparent, can easily be updated as new data becomes available, and can be tweaked or adapted to address related questions. ## Under construction This book provides a basic introduction into how to analyze Canadian data in R and can be used as a standalone resource to cover basic data analysis and visualization workflows in R, as well as be a comprehensive introduction into Canadian data sources. At the same time, we view this as a resources that requires continuous updating as discover new gaps in data analysis and the needs of the Canadian data community changes. ![](baustelle.png){#baustelle style="float:right;margin-left:3em;" width="150"} We are planning to add to this book as we find time and come across good examples that are simple enough to slowly build skills, as well as interesting enough to be engaging and motivating to the reader. In this process the order of sections will change as we add new material, and we will come back and revise existing sections as we receive feedback from readers, which we encourage and is ideally submitted as a [GitHub issue](https://github.com/mountainMath/canadian_data/issues). When referring to this book we recommend to refer by section name rather than section number. Links in the online version of the book are based on section names and will be unaffected by re-numbering of sections. ## Project based approach We are taking a project based approach to teach through examples, with one project per section. Each project will be broken up into distinct steps. This standardized workflow acts as scaffolding for data analysis and helps ensure key components of analysis are adequately addressed. 1. **Formulating the question.** What is the question we are interested in? Asking a clear question will help focus our efforts and ensure that we don't aimlessly trawl through data. This also involves being clear about the quantities of interest, as well as the target population that we seek to understand. 2. **Identifying possible data sources.** Here we try to identify data sources that can speak to our question. We will also take the time to read up on definitions and background concepts to better understand the data and prepare us for data analysis, and understand how well the concepts in the data match our original question from step 1. 3. **Data acquisition.** In this step we will import the data into our current working session. This could be as simple as an API call, or more complicated like scraping a table from the web, or involve even more complex techniques to acquire the data. 4. **Data preparation.** In this step we will reshape and filter the data to prepare it for analysis. 5. **Analysis.** This step could be as simple as computing percentages or even doing nothing, if the quantities we are interested in already come with the dataset, if our question can be answered by a simple descriptive analysis. In other cases, when our question is more complex, this step may be much more involved. The book will try to slowly build up analysis skills along the way, with increasing complexity of questions and required analysis. 6. **Visualization.** The final step in the analysis process is to visualize and communicate the results. In some cases this can be done via a table or a couple of paragraphs of text explaining the results, but in most cases it is useful to produce graphs or maps or even interactive visualizations to effectively communicate the results. 7. **Interpretation.** What's left to wrap this up is to interpret the results. How does this answer our question, where does it fall short. What does this mean in the real-world context? What new questions emerge from this? While we won't always follow this step by step process to the letter, it will be our guiding principle throughout the book. Sometimes things won't go so clean, where after the visualization step we notice that something looks off or is unexpected, and we may jump back up a couple of steps and add more data and redo parts of the analysis to better understand our data and how it speaks to our initial questions. We might even come to understand that our initial question was not helpful or was ill-posed, and we will come back to refine it. This approach to data analysis leans in parts on [@lundberg2021], who describe a more rigorous framework how data analysis questions should get approached and is a great resource for expert users. ## Goals By taking this approach we have several goals in mind: - Provide guidance and guardrails for basic data analysis tasks. - Teach basic data literacy, appreciate definitions and quirks in the data. - Expose the world of Canadian data and increase accessibility. - Learn how data can be interpreted in different ways, and data and analysis is not necessarily "neutral". - Learn how to effectively communicate results. - Learn how to adapt and leverage off of previous work to answer new questions. - Learn how to reproduce and critique data analysis. - Build a community around Canadian data, where people interested in similar questions, or people using the same data, can learn from each other. - Raise the level of understanding of Canadian data and data analysis so we are better equipped to tackle the problems Canada faces. - Stay motivated by using real world Canada-focused and (hopefully) interesting examples. This is setting a very high goal for this book, and we are not sure we can achieve all of this. But we will try our best to be accessible and interesting as possible. ## Why use R? Most people reading this book will not have used R before, or only used it peripherally, maybe during a college course many years in the past. Instead, readers may be familiar with working through housing and demographic data in Excel or similar tools. Or making maps in QGIS or similar tools when dealing with spatial data. And the type of analysis outlined above that this book will teach can in general terms be accomplished using these tools. But where tools like spreadsheets and desktop GIS fall short is in another important focus of this book: **transparency**, **reproducibility**, and **adaptability**. An analysis in a spreadsheet or desktop GIS typically involves a lot of manual steps, the work is not **reproducible** without repeating these steps. We can't easily inspect how the result was derived, the analysis lacks **transparency**. When we just compute a ratio or percentage this may not be problematic, but trying to understand how a more complex analysis was done in a spreadsheet easily turns into a nightmare. Analysis that involves a lot of manual steps is not auditable without putting in the work to repeat those manual steps. But why does this matter? It's always been this way, some experts produce analysis and produce a glossy paper to present the results. One can argue if this was an adequate modus operandi in the past, but we feel strongly that it isn't in today's world. The lines between experts and non-experts has become blurred, and the value we place on lived experience has increased relative to more formal expertise. We argue this places different demands on policy-relevant analysis, it needs to be open and transparent, in principle anyone should be able to understand how the analysis was done and the conclusions were reached. That's where reproducibility and transparency come in. Additionally, it requires bringing up data analysis skills in the broader population, so that the ability to reproduce and critique an analysis in principle can be realized in practice. The remaining reason for using R, **adaptability**, has likewise become increasingly important. The amount of data available to us has increased tremendously, but our collective ability to analyse data and extract information has not kept up. Doing analysis in R allows us to efficiently reuse previous analysis to perform a similar one. Or to build on previous analysis to deepen it. Which turbocharges our ability to do analysis, covering more ground and going deeper. R is not the only framework to do this in, there are other options like python or julia. But we believe that R is best suited for people transitioning into this space, and we can rely on an existing ecosystem of packages to access and process Canadian data. People already proficient in python, julia, STATA, SAS or SPSS will have little difficulty translating what we do into their preferred framework, or dynamically switch back and forth between R, python, julia, or whatever other tools they prefer as needed and convenient. ## Building a Canadian data community Which brings us to our most ambitious goal, to help create a community around Canadian data analysis. When analysis is transparent, reproducible and adaptable people can piggy-back of each others work, reusing parts of analysis others have done and building and improving upon it. Or critiquing and correcting analysis, or taking it toward a different direction. A community that grows in their understanding of data, and a community using a shared set of tools to access and process Canadian data, enabling discussions to move forward instead of in circles. A community that builds up expertise from the bottom up. The book tries to address both of these requirements for building a Canadian data community, a principled approach to data and data analysis, while introducing R as a common framework to work in hoping that the reader will come away with - better data literacy skills to understand and critique data analysis, - technical skills to reproduce and perform their own data analysis, and - a common tool set for acquiring, processing and analyzing Canadian data that facilitates collaborative practices.