Skip to contents

For a **DuckDB-backed lazy table** (the typical case), bootstrap replicate weights are written directly into the DuckDB file as a separate table and exposed through a persistent VIEW that joins the main survey table with the BSW columns. The returned `tbl` references this view, so all downstream dplyr operations have access to every replicate.

Usage

add_bootstrap_weights(
  tbl,
  weight_col,
  id_col = NULL,
  strata_cols = NULL,
  n_replicates = 500L,
  prefix = "CPBSW",
  bsw_table = NULL,
  seed = NULL,
  overwrite = FALSE
)

Arguments

tbl

A lazy `dplyr::tbl()` returned by [get_pumf()], **or** an in-memory `data.frame` / `tibble`.

weight_col

Name of the column holding the survey weights (string, e.g. `"PWEIGHT"`).

id_col

Optional name of a column that uniquely identifies each row (DuckDB path only). If `NULL` (default), the registry `bsw_join_key` is used when available; otherwise `pumf_row_id` is added to the main table.

strata_cols

Optional character vector of column names to stratify on. Resampling is performed independently within each unique combination of stratum values, preserving stratum sample sizes across replicates. For LFS, defaults to `c("SURVYEAR", "SURVMNTH")` so each month is resampled separately. For other surveys, use the registry `bsw_strata` field or pass explicitly (e.g. province, age group). Pass `character(0)` to suppress the LFS default and generate unstratified weights.

n_replicates

Number of bootstrap replicates to generate (default `500L`).

prefix

Column-name prefix for replicate columns (default `"CPBSW"`). Columns are named `prefix1`, `prefix2`, ...

bsw_table

Name of the DuckDB table that stores the replicate weights (DuckDB path only). Defaults to `NULL`, which auto-names it `paste0("pumf_bsw_", tolower(weight_col))` so separate calls with different weight columns do not overwrite each other.

seed

Optional integer seed for reproducibility.

overwrite

If the `bsw_table` already exists in the DuckDB file, regenerate and overwrite it when `TRUE`. When `FALSE` (default) the existing table is reused silently – no computation is performed.

Value

* **DuckDB path:** a lazy `dplyr::tbl()` backed by a persistent DuckDB VIEW that contains all original survey columns plus the `n_replicates` bootstrap weight columns, with any input `filter()` operations re-applied. * **In-memory path:** the input `data.frame` / `tibble` with bootstrap weight columns appended so that `n_replicates` replicates are present. If the input already carries replicate columns for `prefix`, only the additional ones are generated (existing columns are preserved); when it already has at least `n_replicates`, the data frame is returned unchanged.

Details

For an **in-memory `data.frame` or `tibble`**, bootstrap weights are generated entirely in memory and the augmented data frame is returned.

Bootstrap weights are generated by the rescaled bootstrap: for each replicate a sample of \(n\) rows is drawn with replacement; the bootstrap weight for row \(i\) in replicate \(b\) is `original_weight[i] * count[i,b]`, where `count[i,b]` is the number of times row \(i\) appeared in draw \(b\).

**Incremental re-runs (DuckDB path):** when a BSW table already exists the call only does the work needed to satisfy the request: * **More replicates** than stored (and no new rows): the additional replicate columns are appended; existing columns are kept. * **New rows** in the main table (some rows have no weights yet): because a bootstrap replicate resamples the full population, added rows invalidate the existing weights of their resampling universe, so those weights are deleted and regenerated. Unstratified, this regenerates every row; when `strata_cols` are in effect, only the strata that gained rows are regenerated and complete strata keep their existing weights. * **Neither:** the stored weights are reused without recomputation. Pass `overwrite = TRUE` to force a full fresh regeneration regardless.

**Multiple weight columns (hierarchical data):** by default `bsw_table` is named after `weight_col` (e.g. `"pumf_bsw_wstpwgt"`), so calling the function twice with different weight columns (e.g. household weight and person weight) produces two independent BSW tables and two separate views without any conflict.

**Connection note (DuckDB path):** calling this function fully shuts down the DuckDB in-process instance held by `tbl` (because a write connection requires exclusive access). The input `tbl` and any other lazy tables backed by the same DuckDB file become invalid after the call. Use the returned tbl instead.

**Filtered input tbls (DuckDB path):** bootstrap weights always cover the complete physical survey table. If `tbl` has dplyr `filter()` operations applied, they are captured and automatically re-applied to the returned VIEW tbl so the visible rows match the original subset. Other operations (`select()`, `mutate()`, etc.) are not replayed – they would interfere with the BSW columns – so apply them manually to the returned tbl if needed.

**ID column (DuckDB path):** a stable row identifier is needed to link the main table to the BSW table. If `id_col` is `NULL` (the default): * The survey registry `bsw_join_key` is used when available (e.g. `"PEFAMID"` for SFS 2016-2023) – no table modification needed. * Otherwise a `pumf_row_id` column (DuckDB `rowid`) is added to the main survey table. The `ALTER TABLE ADD COLUMN` is O(1); the `UPDATE` that fills the values is O(n).

See also

[bsw_info()], [remove_bootstrap_weights()], [get_pumf()]

Examples

if (FALSE) { # \dontrun{
sfs <- get_pumf("SFS", "2019")
sfs_bsw <- add_bootstrap_weights(sfs, weight_col = "PWEIGHT",
                                 n_replicates = 200L, seed = 42L)
bsw_info(sfs_bsw)
close_pumf(sfs_bsw)
} # }