The package r2dii.match helps you to match counterparties from a loanbook to companies in a physical-asset database. Each section below shows you how.
Setup
We use the package r2dii.match to access the most important functions you’ll learn about. We also use example datasets from the package r2dii.data, and optional but convenient functions from the packages dplyr and readr.
library(dplyr, warn.conflicts = FALSE)
library(r2dii.data)
library(r2dii.match)
Format input data loanbook and asset-based company data (abcd)
We need two datasets: a “loanbook” and an “asset-based company
dataset” (abcd). These should be formatted like: loanbook_demo
and abcd_demo
(from the r2dii.data
package).
A note on sector classification: Matches are preferred when the
sector from the loanbook
matches the sector from the
abcd
. The loanbook
sector is determined
internally using the sector_classification_system
and
sector_classification_direct_loantaker
columns. Currently,
we only allow a couple specific values for
sector_classification_system
:
sector_classifications$code_system %>%
unique()
#> [1] "CNB" "GICS" "ISIC" "NACE" "NAICS" "PSIC" "SIC"
If you would like to use a different classification system, please raise an issue in r2dii.data and we can incorporate it.
loanbook_demo
#> # A tibble: 323 × 19
#> id_loan id_direct_loantaker name_direct_loantaker id_intermediate_pare…¹
#> <chr> <chr> <chr> <chr>
#> 1 L1 C294 Vitale Group NA
#> 2 L2 C293 Moen-Moen NA
#> 3 L3 C292 Rowe-Rowe IP5
#> 4 L4 C299 Fadel-Fadel NA
#> 5 L5 C305 Ring AG & Co. KGaA NA
#> 6 L6 C304 Kassulke-Kassulke NA
#> 7 L7 C227 Morissette Group NA
#> 8 L8 C303 Barone s.r.l. NA
#> 9 L9 C301 Werner Werner AG & Co. KG… IP10
#> 10 L10 C302 De rosa s.r.l. NA
#> # ℹ 313 more rows
#> # ℹ abbreviated name: ¹id_intermediate_parent_1
#> # ℹ 15 more variables: name_intermediate_parent_1 <chr>,
#> # id_ultimate_parent <chr>, name_ultimate_parent <chr>,
#> # loan_size_outstanding <dbl>, loan_size_outstanding_currency <chr>,
#> # loan_size_credit_limit <dbl>, loan_size_credit_limit_currency <chr>,
#> # sector_classification_system <chr>, …
abcd_demo
#> # A tibble: 4,972 × 14
#> company_id name_company lei sector technology production_unit year
#> <chr> <chr> <lgl> <chr> <chr> <chr> <int>
#> 1 181 Giordano, Giordano … NA power hydrocap MW 2020
#> 2 181 Giordano, Giordano … NA power hydrocap MW 2021
#> 3 181 Giordano, Giordano … NA power hydrocap MW 2022
#> 4 181 Giordano, Giordano … NA power hydrocap MW 2023
#> 5 181 Giordano, Giordano … NA power hydrocap MW 2024
#> 6 181 Giordano, Giordano … NA power hydrocap MW 2025
#> 7 525 Vitali, Vitali e Vi… NA cement integrate… tonnes per year 2020
#> 8 525 Vitali, Vitali e Vi… NA cement integrate… tonnes per year 2021
#> 9 525 Vitali, Vitali e Vi… NA cement integrate… tonnes per year 2022
#> 10 525 Vitali, Vitali e Vi… NA cement integrate… tonnes per year 2023
#> # ℹ 4,962 more rows
#> # ℹ 7 more variables: production <dbl>, emission_factor <dbl>,
#> # country_of_domicile <chr>, plant_location <chr>, is_ultimate_owner <lgl>,
#> # abcd_timestamp <chr>, emission_factor_unit <chr>
If you want to use loanbook_demo
and
abcd_demo
as template to create your own datasets, do
this:
- Write loanbook_demo.csv and abcd_demo.csv with:
# Writting to current working directory
loanbook_demo %>%
readr::write_csv(path = "loanbook_demo.csv")
abcd_demo %>%
readr::write_csv(path = "abcd_demo.csv")
- For each dataset, replace our demo data with your data.
- Save each dataset as, for example, your_loanbook.csv and your_abcd.csv.
- Read your datasets back into R with:
# Reading from current working directory
your_loanbook <- readr::read_csv("your_loanbook.csv")
your_abcd <- readr::read_csv("your_abcd.csv")
Here we continue to use the *_demo
datasets, pretending
they contain the data of your own.
# WARNING: Skip this to avoid overwriting your data with our demo data
your_loanbook <- loanbook_demo
your_abcd <- abcd_demo
Score the goodness of the match between the loanbook and abcd datasets
match_name()
scores the match between names in a
loanbook dataset (lbk) and names in an asset-based company dataset
(abcd). The names come from the columns
name_direct_loantaker
,
name_intermediate_parent_*
and
name_ultimate_parent
of the loanbook dataset, and from the
column name_company
of the a asset-based company dataset.
There can be any number of name_intermediate_parent_*
columns, where *
indicates the level up the corporate tree
from direct_loantaker
.
The raw names are internally transformed applying best-practices commonly used in name matching algorithms, such as:
- Remove special characters.
- Replace language specific characters.
- Abbreviate certain names to reduce their importance in the matching.
- Removing corporate suffixes when necessary.
- Spell out numbers to increase their importance.
The similarity is then scored between the internally-transformed
names of the loanbook against the abcd. (For more information on the
scoring algorithm used, see: stringdist::stringsim()
).
match_name(your_loanbook, your_abcd)
#> # A tibble: 307 × 28
#> id_loan id_direct_loantaker name_direct_loantaker id_intermediate_pare…¹
#> <chr> <chr> <chr> <chr>
#> 1 L1 C294 Vitale Group NA
#> 2 L3 C292 Rowe-Rowe IP5
#> 3 L5 C305 Ring AG & Co. KGaA NA
#> 4 L6 C304 Kassulke-Kassulke NA
#> 5 L6 C304 Kassulke-Kassulke NA
#> 6 L7 C227 Morissette Group NA
#> 7 L7 C227 Morissette Group NA
#> 8 L8 C303 Barone s.r.l. NA
#> 9 L9 C301 Werner Werner AG & Co. KG… IP10
#> 10 L9 C301 Werner Werner AG & Co. KG… IP10
#> # ℹ 297 more rows
#> # ℹ abbreviated name: ¹id_intermediate_parent_1
#> # ℹ 24 more variables: name_intermediate_parent_1 <chr>,
#> # id_ultimate_parent <chr>, name_ultimate_parent <chr>,
#> # loan_size_outstanding <dbl>, loan_size_outstanding_currency <chr>,
#> # loan_size_credit_limit <dbl>, loan_size_credit_limit_currency <chr>,
#> # sector_classification_system <chr>, …
match_name()
defaults to scoring matches between name
strings that belong to the same sector. Using
by_sector = FALSE
removes this limitation – increasing
computation time, and the number of potentially incorrect matches to
manually validate.
match_name(your_loanbook, your_abcd, by_sector = FALSE) %>%
nrow()
#> [1] 801
# Compare
match_name(your_loanbook, your_abcd, by_sector = TRUE) %>%
nrow()
#> [1] 307
min_score
allows you to minimum threshold
score
.
matched <- match_name(your_loanbook, your_abcd, min_score = 0.9)
range(matched$score)
#> [1] 0.9007692 1.0000000
Maybe overwrite matches
If you are happy with the matching coverage achieved, proceed to the
next step. Otherwise, you can manually add matches, not found
automatically by match_name()
. To do this, manually inspect
the abcd
and find a company you would like to match to your
loanbook. Once a match is found, use excel to write a .csv file similar
to overwrite_demo
,
where:
-
level
indicates the level that the manual match should be added to (e.g.direct_loantaker
) -
id_2dii
is the id of the loanbook company you would like to match (from the output ofmatch_name()
) -
name
is the abcd company you would like to manually link to -
sector
optionally you can also overwrite the sector. -
source
this can be used later to determine where all manual matches came from.
matched <- match_name(
your_loanbook, your_abcd,
min_score = 0.9, overwrite = overwrite_demo
)
#> Warning: You should only overwrite a sector at the level of the 'direct
#> loantaker' (DL). If you overwrite a sector at the level of the 'ultimate
#> parent' (UP) you consequently overwrite all children of that sector,
#> which most likely is a mistake.
Notice the warning.
Validate matches
Write the output of match_name()
into a .csv file
with:
# Writting to current working directory
matched %>%
readr::write_csv("matched.csv")
Compare, edit, and save the data manually:
- Open matched.csv with any spreadsheet editor (Excel, Google Sheets, etc.).
- Compare the columns
name
andname_abcd
manually to determine if the match is valid. Other information can be used in conjunction with just the names to ensure the two entities match (sector, internal information on the company structure, etc.) - Edit the data:
- If you are happy with the match, set the
score
value to1
. - Otherwise set or leave the
score
value to anything other than1
.
- If you are happy with the match, set the
- Save the edited file as, say, valid_matches.csv.
Re-read the edited file (validated) with:
# Reading from current working directory
valid_matches <- readr::read_csv("valid_matches.csv")
Prioritize validated matches by level
The validated dataset may have multiple matches per loan. Consider
the case where a loan is given to “Acme Power USA”, a subsidiary of
“Acme Power Co.”. There may be both “Acme Power USA” and “Acme Power
Co.” in the abcd
, and so there could be two valid matches
for this loan. To get the best match only, use prioritize()
– it picks rows where score
is 1 and level
per
loan is of highest priority()
:
# Pretend we validated the matched dataset
valid_matches <- matched
some_interesting_columns <- c("id_2dii", "level", "score")
valid_matches %>%
prioritize() %>%
select(all_of(some_interesting_columns))
#> # A tibble: 168 × 3
#> id_2dii level score
#> <chr> <chr> <dbl>
#> 1 DL147 direct_loantaker 1
#> 2 DL165 direct_loantaker 1
#> 3 DL310 direct_loantaker 1
#> 4 DL100 direct_loantaker 1
#> 5 DL5 direct_loantaker 1
#> 6 DL92 direct_loantaker 1
#> 7 DL172 direct_loantaker 1
#> 8 DL3 direct_loantaker 1
#> 9 DL74 direct_loantaker 1
#> 10 DL91 direct_loantaker 1
#> # ℹ 158 more rows
By default, highest priority refers to the most granular match
(direct_loantaker
). The default priority is set internally
via prioritize_levels()
.
prioritize_level(matched)
#> [1] "direct_loantaker" "intermediate_parent_1" "ultimate_parent"
You may use a different priority. One way to do that is to pass a
function to priority
. For example, use rev
to
reverse the default priority.
matched %>%
prioritize(priority = rev) %>%
select(all_of(some_interesting_columns))
#> # A tibble: 168 × 3
#> id_2dii level score
#> <chr> <chr> <dbl>
#> 1 UP220 ultimate_parent 1
#> 2 UP115 ultimate_parent 1
#> 3 UP43 ultimate_parent 1
#> 4 UP74 ultimate_parent 1
#> 5 UP258 ultimate_parent 1
#> 6 UP154 ultimate_parent 1
#> 7 UP13 ultimate_parent 1
#> 8 UP21 ultimate_parent 1
#> 9 UP156 ultimate_parent 1
#> 10 UP149 ultimate_parent 1
#> # ℹ 158 more rows