Users of the r2dii.match package
reported that their R session crashed when they fed
match_name()
with big data. A recent
post acknowledged the issue and promised examples on how to handle
big data. This article shows one approach: feed match_name()
with a sequence of small chunks of the loanbook
dataset.
Setup
This example uses r2dii.match plus a few optional but convenient packages, including r2dii.data for example datasets.
# Packages
library(dplyr, warn.conflicts = FALSE)
library(fs)
library(vroom)
library(r2dii.data)
library(r2dii.match)
# Example datasets from the r2dii.data package
loanbook <- loanbook_demo
abcd <- abcd_demo
If the entire loanbook
is too large, feed
match_name()
with smaller chunks, so that any call to
match_name(this_chunk, abcd)
fits in memory. More chunks
take longer to run but use less memory; you’ll need to experiment to
find the number of chunks that best works for you.
Say you try three chunks. You can take the loanbook
dataset and then use mutate()
to add the new column
chunk
, which assigns each row to one of the
chunks
:
chunks <- 3
chunked <- loanbook %>% mutate(chunk = as.integer(cut(row_number(), chunks)))
The total number of rows in the entire loanbook
equals
the sum of the rows across chunks.
count(loanbook)
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 283
count(chunked, chunk)
#> # A tibble: 3 × 2
#> chunk n
#> <int> <int>
#> 1 1 95
#> 2 2 94
#> 3 3 94
For each chunk you need to repeat this process:
- Match this chunk against the entire
abcd
dataset. - If this chunk matched nothing, move to the next chunk.
- Else, save the result to a .csv file.
# This "output" directory is temporary; you may use any folder in your computer
out <- path(tempdir(), "output")
if (!dir_exists(out)) dir_create(out)
for (i in unique(chunked$chunk)) {
# 1. Match this chunk against the entire `abcd` dataset.
this_chunk <- filter(chunked, chunk == i)
this_result <- match_name(this_chunk, abcd)
# 2. If this chunk matched nothing, move to the next chunk
matched_nothing <- nrow(this_result) == 0L
if (matched_nothing) next()
# 3. Else, save the result to a .csv file.
vroom_write(this_result, path(out, paste0(i, ".csv")))
}
The result is one .csv file per chunk.
dir_ls(out)
#> /tmp/RtmpkKv0Mc/output/1.csv /tmp/RtmpkKv0Mc/output/2.csv
#> /tmp/RtmpkKv0Mc/output/3.csv
You can read and combine all files in one step with
vroom()
.
matched <- vroom(dir_ls(out))
#> Rows: 326 Columns: 23
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (17): id_loan, id_direct_loantaker, name_direct_loantaker, id_ultimate_p...
#> dbl (4): loan_size_outstanding, loan_size_credit_limit, chunk, score
#> lgl (2): isin_direct_loantaker, borderline
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
matched
#> # A tibble: 326 × 23
#> id_loan id_direct_loantaker name_direct_loantaker id_ultimate_parent
#> <chr> <chr> <chr> <chr>
#> 1 L1 C294 Vitale Group UP15
#> 2 L3 C292 Rowe-Rowe UP288
#> 3 L5 C305 Ring AG & Co. KGaA UP104
#> 4 L6 C304 Kassulke-Kassulke UP83
#> 5 L6 C304 Kassulke-Kassulke UP83
#> 6 L7 C227 Morissette Group UP134
#> 7 L7 C227 Morissette Group UP134
#> 8 L8 C303 Barone s.r.l. UP163
#> 9 L9 C301 Werner Werner AG & Co. KGaA UP138
#> 10 L9 C301 Werner Werner AG & Co. KGaA UP138
#> # ℹ 316 more rows
#> # ℹ 19 more variables: name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_direct_loantaker <chr>, lei_direct_loantaker <chr>,
#> # isin_direct_loantaker <lgl>, chunk <dbl>, id_2dii <chr>, level <chr>,
#> # sector <chr>, sector_abcd <chr>, name <chr>, name_abcd <chr>, …
The matched
result should be similar to that of
match_name(loanbook, abcd)
. Your next steps are documented
on the Home page
and Get
started sections of the package website.
Anecdote
I tested match_name()
with datasets which size (on disk
as a .csv file) was 20MB for the loanbook
dataset and 100MB
for the abcd
dataset. Feeding match_name()
with the entire loanbook
crashed my R session. But feeding
it with a sequence of 30 chunks run in about 25’ – successfully; the
combined result had over 10 million rows:
sector data
---------------------------------
1 automotive [2,644,628 × 15]
2 aviation [377,200 × 15]
3 cement [942,526 × 15]
4 oil and gas [1,551,805 × 15]
5 power [7,353,772 × 15]
6 shipping [4,194,067 × 15]
7 steel [15 × 15]