ageutils provides a collection of efficient functions for working with individual ages and corresponding interval representations. These include:
cut_ages()
for converting from an integer age to an interval range;breaks_to_interval()
which splits aggregated counts based on
user-specified age distributions;reaggregate_age_counts()
and reaggregate_age_rates()
for the
reaggregation of counts (and rates) from one interval range to another.library(ageutils)
cut_ages()
provides categorisation of ages based on specified breaks which
represent the left-hand interval limits.
It returns a tibble with an ordered factor
column (interval
), as well as columns corresponding to the resulting bounds
(lower
and upper
). The resulting intervals span from the minimum break
through to a specified max_upper
(defaulting to Inf
) and will always be
closed on the left and open on the right.
cut_ages(ages = 0:9, breaks = c(0, 3, 5, 10))
#> # A tibble: 10 × 3
#> interval lower upper
#> <ord> <dbl> <dbl>
#> 1 [0, 3) 0 3
#> 2 [0, 3) 0 3
#> 3 [0, 3) 0 3
#> 4 [3, 5) 3 5
#> 5 [3, 5) 3 5
#> 6 [5, 10) 5 10
#> 7 [5, 10) 5 10
#> 8 [5, 10) 5 10
#> 9 [5, 10) 5 10
#> 10 [5, 10) 5 10
cut_ages(ages = 0:9, breaks = c(0, 5))
#> # A tibble: 10 × 3
#> interval lower upper
#> <ord> <dbl> <dbl>
#> 1 [0, 5) 0 5
#> 2 [0, 5) 0 5
#> 3 [0, 5) 0 5
#> 4 [0, 5) 0 5
#> 5 [0, 5) 0 5
#> 6 [5, Inf) 5 Inf
#> 7 [5, Inf) 5 Inf
#> 8 [5, Inf) 5 Inf
#> 9 [5, Inf) 5 Inf
#> 10 [5, Inf) 5 Inf
Ages above max_upper
will be returned as NA.
cut_ages(ages = 0:10, breaks = c(0, 5), max_upper = 7)
#> # A tibble: 11 × 3
#> interval lower upper
#> <ord> <dbl> <dbl>
#> 1 [0, 5) 0 5
#> 2 [0, 5) 0 5
#> 3 [0, 5) 0 5
#> 4 [0, 5) 0 5
#> 5 [0, 5) 0 5
#> 6 [5, 7) 5 7
#> 7 [5, 7) 5 7
#> 8 <NA> NA NA
#> 9 <NA> NA NA
#> 10 <NA> NA NA
#> 11 <NA> NA NA
Output is comparable to cut
with right = FALSE
:
ages <- seq.int(from = 0, by = 10, length.out = 10)
breaks <- c(0, 1, 10, 30)
cut_ages(ages, breaks)
#> # A tibble: 10 × 3
#> interval lower upper
#> <ord> <dbl> <dbl>
#> 1 [0, 1) 0 1
#> 2 [10, 30) 10 30
#> 3 [10, 30) 10 30
#> 4 [30, Inf) 30 Inf
#> 5 [30, Inf) 30 Inf
#> 6 [30, Inf) 30 Inf
#> 7 [30, Inf) 30 Inf
#> 8 [30, Inf) 30 Inf
#> 9 [30, Inf) 30 Inf
#> 10 [30, Inf) 30 Inf
cut(ages, right = FALSE, breaks = c(breaks, Inf))
#> [1] [0,1) [10,30) [10,30) [30,Inf) [30,Inf) [30,Inf) [30,Inf) [30,Inf)
#> [9] [30,Inf) [30,Inf)
#> Levels: [0,1) [1,10) [10,30) [30,Inf)
Internally both bound columns are stored as double but it can be taken as part
of the function API that lower
is coercible to integer without any coercion to
NA_integer_
. Similarly all values of upper
apart from those corresponding to
max_upper
can be assumed coercible to integer (max_upper
may or may not
depending on the given argument).
breaks_to_interval()
takes a specified set of breaks representing the left
hand limits of a closed open interval, i.e [x, y), and returns a tibble with an
ordered factor column (interval
), as well as columns corresponding to the
explicit bounds (lower
and upper
).
The resulting intervals span from the minimum break through to a specified
max_upper
.
breaks_to_interval(breaks = c(0, 1, 5, 15, 25, 45, 65))
#> # A tibble: 7 × 3
#> interval lower upper
#> <ord> <dbl> <dbl>
#> 1 [0, 1) 0 1
#> 2 [1, 5) 1 5
#> 3 [5, 15) 5 15
#> 4 [15, 25) 15 25
#> 5 [25, 45) 25 45
#> 6 [45, 65) 45 65
#> 7 [65, Inf) 65 Inf
breaks_to_interval(
breaks = c(0, 1, 5, 15, 25, 45, 65),
max_upper = 100
)
#> # A tibble: 7 × 3
#> interval lower upper
#> <ord> <dbl> <dbl>
#> 1 [0, 1) 0 1
#> 2 [1, 5) 1 5
#> 3 [5, 15) 5 15
#> 4 [15, 25) 15 25
#> 5 [25, 45) 25 45
#> 6 [45, 65) 45 65
#> 7 [65, 100) 65 100
reaggregate_counts()
converts population counts over one interval range to
a different, user-specified, range. It returns a
tibble with an ordered factor
column (interval
), columns corresponding to the resulting bounds
(lower
and upper
) and the associated count
.
For a small illustration of the basic functionality we use data obtained from the 2021 UK census:
head(pop_dat, 20)
#> area_code area_name age_category value
#> 1 K04000001 England and Wales [0, 5) 3232100
#> 2 K04000001 England and Wales [5, 10) 3524600
#> 3 K04000001 England and Wales [10, 15) 3595900
#> 4 K04000001 England and Wales [15, 20) 3394700
#> 5 K04000001 England and Wales [20, 25) 3602100
#> 6 K04000001 England and Wales [25, 30) 3901800
#> 7 K04000001 England and Wales [30, 35) 4148800
#> 8 K04000001 England and Wales [35, 40) 3981600
#> 9 K04000001 England and Wales [40, 45) 3755700
#> 10 K04000001 England and Wales [45, 50) 3788700
#> 11 K04000001 England and Wales [50, 55) 4123400
#> 12 K04000001 England and Wales [55, 60) 4029000
#> 13 K04000001 England and Wales [60, 65) 3455700
#> 14 K04000001 England and Wales [65, 70) 2945100
#> 15 K04000001 England and Wales [70, 75) 2978000
#> 16 K04000001 England and Wales [75, 80) 2170300
#> 17 K04000001 England and Wales [80, 85) 1517000
#> 18 K04000001 England and Wales [85, 90) 925100
#> 19 K04000001 England and Wales [90, Inf) 527900
Here, each row of the data is for the same region so we drop some unwanted columns before proceeding to pull out the lower bounds.
dat <- subset(pop_dat, select = c(age_category, value))
dat <- transform(
dat,
lower_bound = as.integer(sub("\\[([0-9]+), .+)", "\\1", age_category))
)
Now we recategorise to the desired age intervals
with(
dat,
reaggregate_counts(
bounds = lower_bound,
counts = value,
new_bounds = c(0, 1, 5, 15, 25, 45, 65)
)
)
#> # A tibble: 7 × 4
#> interval lower upper count
#> <ord> <dbl> <dbl> <dbl>
#> 1 [0, 1) 0 1 646420
#> 2 [1, 5) 1 5 2585680
#> 3 [5, 15) 5 15 7120500
#> 4 [15, 25) 15 25 6996800
#> 5 [25, 45) 25 45 15787900
#> 6 [45, 65) 45 65 15396800
#> 7 [65, Inf) 65 Inf 11063400
Similarly, let’s assume we have a population sample of 1000, with 600 known to be over the age of 50, the rest below. We can reaggregate these across 10 year intervals with based on the weightings of the census
reaggregate_counts(
bounds = c(0, 60),
counts = c(400, 600),
new_bounds = seq(from = 0, to = 90, by = 10),
population_bounds = dat$lower_bound,
population_weights = dat$value
)
#> # A tibble: 10 × 4
#> interval lower upper count
#> <ord> <dbl> <dbl> <dbl>
#> 1 [0, 10) 0 10 60.0
#> 2 [10, 20) 10 20 62.0
#> 3 [20, 30) 20 30 66.6
#> 4 [30, 40) 30 40 72.1
#> 5 [40, 50) 40 50 66.9
#> 6 [50, 60) 50 60 72.3
#> 7 [60, 70) 60 70 265.
#> 8 [70, 80) 70 80 213.
#> 9 [80, 90) 80 90 101.
#> 10 [90, Inf) 90 Inf 21.8
As with reaggregate_counts()
but set up for rates.
reaggregate_rates(
bounds = c(0, 5, 10),
rates = c(0.1, 0.2, 0.3),
new_bounds = c(0, 2, 7, 10),
population_bounds = dat$lower_bound,
population_weights = dat$value
)
#> # A tibble: 4 × 4
#> interval lower upper rate
#> <ord> <dbl> <dbl> <dbl>
#> 1 [0, 2) 0 2 0.1
#> 2 [2, 7) 2 7 0.142
#> 3 [7, 10) 7 10 0.2
#> 4 [10, Inf) 10 Inf 0.3
reaggregate_rates(
bounds = 0:99,
rates = rep(seq(25, 5, -5), each = 20),
new_bounds = c(0, 5, 15, 45, 65),
population_bounds = dat$lower_bound,
population_weights = dat$value
)
#> # A tibble: 5 × 4
#> interval lower upper rate
#> <ord> <dbl> <dbl> <dbl>
#> 1 [0, 5) 0 5 25
#> 2 [5, 15) 5 15 25
#> 3 [15, 45) 15 45 19.9
#> 4 [45, 65) 45 65 13.9
#> 5 [65, Inf) 65 Inf 8.66