Skip to contents

The package r2dii.match helps you to match counterparties from a loanbook to companies in a physical-asset database. Each section below shows you how.

Setup

We use the package r2dii.match to access the most important functions you’ll learn about. We also use example datasets from the package r2dii.data, and optional but convenient functions from the packages dplyr and readr.

library(dplyr, warn.conflicts = FALSE)
library(r2dii.data)
library(r2dii.match)

Format input data loanbook and asset-based company data (abcd)

We need two datasets: a “loanbook” and an “asset-based company dataset” (abcd). These should be formatted like: loanbook_demo and abcd_demo (from the r2dii.data package).

A note on sector classification: Matches are preferred when the sector from the loanbook matches the sector from the abcd. The loanbook sector is determined internally using the sector_classification_system and sector_classification_direct_loantaker columns. Currently, we only allow a couple specific values for sector_classification_system:

sector_classifications$code_system %>%
  unique()
#> [1] "CNB"   "GICS"  "ISIC"  "NACE"  "NAICS" "PSIC"  "SIC"

If you would like to use a different classification system, please raise an issue in r2dii.data and we can incorporate it.

loanbook_demo
#> # A tibble: 323 × 19
#>    id_loan id_direct_loantaker name_direct_loantaker      id_intermediate_pare…¹
#>    <chr>   <chr>               <chr>                      <chr>                 
#>  1 L1      C294                Vitale Group               NA                    
#>  2 L2      C293                Moen-Moen                  NA                    
#>  3 L3      C292                Rowe-Rowe                  IP5                   
#>  4 L4      C299                Fadel-Fadel                NA                    
#>  5 L5      C305                Ring AG & Co. KGaA         NA                    
#>  6 L6      C304                Kassulke-Kassulke          NA                    
#>  7 L7      C227                Morissette Group           NA                    
#>  8 L8      C303                Barone s.r.l.              NA                    
#>  9 L9      C301                Werner Werner AG & Co. KG… IP10                  
#> 10 L10     C302                De rosa s.r.l.             NA                    
#> # ℹ 313 more rows
#> # ℹ abbreviated name: ¹​id_intermediate_parent_1
#> # ℹ 15 more variables: name_intermediate_parent_1 <chr>,
#> #   id_ultimate_parent <chr>, name_ultimate_parent <chr>,
#> #   loan_size_outstanding <dbl>, loan_size_outstanding_currency <chr>,
#> #   loan_size_credit_limit <dbl>, loan_size_credit_limit_currency <chr>,
#> #   sector_classification_system <chr>, …

abcd_demo
#> # A tibble: 4,972 × 14
#>    company_id name_company         lei   sector technology production_unit  year
#>    <chr>      <chr>                <lgl> <chr>  <chr>      <chr>           <int>
#>  1 181        Giordano, Giordano … NA    power  hydrocap   MW               2020
#>  2 181        Giordano, Giordano … NA    power  hydrocap   MW               2021
#>  3 181        Giordano, Giordano … NA    power  hydrocap   MW               2022
#>  4 181        Giordano, Giordano … NA    power  hydrocap   MW               2023
#>  5 181        Giordano, Giordano … NA    power  hydrocap   MW               2024
#>  6 181        Giordano, Giordano … NA    power  hydrocap   MW               2025
#>  7 525        Vitali, Vitali e Vi… NA    cement integrate… tonnes per year  2020
#>  8 525        Vitali, Vitali e Vi… NA    cement integrate… tonnes per year  2021
#>  9 525        Vitali, Vitali e Vi… NA    cement integrate… tonnes per year  2022
#> 10 525        Vitali, Vitali e Vi… NA    cement integrate… tonnes per year  2023
#> # ℹ 4,962 more rows
#> # ℹ 7 more variables: production <dbl>, emission_factor <dbl>,
#> #   country_of_domicile <chr>, plant_location <chr>, is_ultimate_owner <lgl>,
#> #   abcd_timestamp <chr>, emission_factor_unit <chr>

If you want to use loanbook_demo and abcd_demo as template to create your own datasets, do this:

  • Write loanbook_demo.csv and abcd_demo.csv with:
# Writting to current working directory 
loanbook_demo %>% 
  readr::write_csv(path = "loanbook_demo.csv")

abcd_demo %>% 
  readr::write_csv(path = "abcd_demo.csv")
  • For each dataset, replace our demo data with your data.
  • Save each dataset as, for example, your_loanbook.csv and your_abcd.csv.
  • Read your datasets back into R with:
# Reading from current working directory 
your_loanbook <- readr::read_csv("your_loanbook.csv")
your_abcd <- readr::read_csv("your_abcd.csv")

Here we continue to use the *_demo datasets, pretending they contain the data of your own.

# WARNING: Skip this to avoid overwriting your data with our demo data
your_loanbook <- loanbook_demo
your_abcd <- abcd_demo

Score the goodness of the match between the loanbook and abcd datasets

match_name() scores the match between names in a loanbook dataset (lbk) and names in an asset-based company dataset (abcd). The names come from the columns name_direct_loantaker, name_intermediate_parent_* and name_ultimate_parent of the loanbook dataset, and from the column name_company of the a asset-based company dataset. There can be any number of name_intermediate_parent_* columns, where * indicates the level up the corporate tree from direct_loantaker.

The raw names are internally transformed applying best-practices commonly used in name matching algorithms, such as:

  • Remove special characters.
  • Replace language specific characters.
  • Abbreviate certain names to reduce their importance in the matching.
  • Removing corporate suffixes when necessary.
  • Spell out numbers to increase their importance.

The similarity is then scored between the internally-transformed names of the loanbook against the abcd. (For more information on the scoring algorithm used, see: stringdist::stringsim()).

match_name(your_loanbook, your_abcd)
#> # A tibble: 307 × 28
#>    id_loan id_direct_loantaker name_direct_loantaker      id_intermediate_pare…¹
#>    <chr>   <chr>               <chr>                      <chr>                 
#>  1 L1      C294                Vitale Group               NA                    
#>  2 L3      C292                Rowe-Rowe                  IP5                   
#>  3 L5      C305                Ring AG & Co. KGaA         NA                    
#>  4 L6      C304                Kassulke-Kassulke          NA                    
#>  5 L6      C304                Kassulke-Kassulke          NA                    
#>  6 L7      C227                Morissette Group           NA                    
#>  7 L7      C227                Morissette Group           NA                    
#>  8 L8      C303                Barone s.r.l.              NA                    
#>  9 L9      C301                Werner Werner AG & Co. KG… IP10                  
#> 10 L9      C301                Werner Werner AG & Co. KG… IP10                  
#> # ℹ 297 more rows
#> # ℹ abbreviated name: ¹​id_intermediate_parent_1
#> # ℹ 24 more variables: name_intermediate_parent_1 <chr>,
#> #   id_ultimate_parent <chr>, name_ultimate_parent <chr>,
#> #   loan_size_outstanding <dbl>, loan_size_outstanding_currency <chr>,
#> #   loan_size_credit_limit <dbl>, loan_size_credit_limit_currency <chr>,
#> #   sector_classification_system <chr>, …

match_name() defaults to scoring matches between name strings that belong to the same sector. Using by_sector = FALSE removes this limitation – increasing computation time, and the number of potentially incorrect matches to manually validate.

match_name(your_loanbook, your_abcd, by_sector = FALSE) %>%
  nrow()
#> [1] 801

# Compare
match_name(your_loanbook, your_abcd, by_sector = TRUE) %>%
  nrow()
#> [1] 307

min_score allows you to minimum threshold score.

matched <- match_name(your_loanbook, your_abcd, min_score = 0.9)
range(matched$score)
#> [1] 0.9007692 1.0000000

Maybe overwrite matches

If you are happy with the matching coverage achieved, proceed to the next step. Otherwise, you can manually add matches, not found automatically by match_name(). To do this, manually inspect the abcd and find a company you would like to match to your loanbook. Once a match is found, use excel to write a .csv file similar to overwrite_demo, where:

  • level indicates the level that the manual match should be added to (e.g. direct_loantaker)
  • id_2dii is the id of the loanbook company you would like to match (from the output of match_name())
  • name is the abcd company you would like to manually link to
  • sector optionally you can also overwrite the sector.
  • source this can be used later to determine where all manual matches came from.
matched <- match_name(
  your_loanbook, your_abcd,
  min_score = 0.9, overwrite = overwrite_demo
)
#> Warning: You should only overwrite a sector at the level of the 'direct
#> loantaker' (DL). If you overwrite a sector at the level of the 'ultimate
#> parent' (UP) you consequently overwrite all children of that sector,
#> which most likely is a mistake.

Notice the warning.

Validate matches

Write the output of match_name() into a .csv file with:

# Writting to current working directory
matched %>%
  readr::write_csv("matched.csv")

Compare, edit, and save the data manually:

  • Open matched.csv with any spreadsheet editor (Excel, Google Sheets, etc.).
  • Compare the columns name and name_abcd manually to determine if the match is valid. Other information can be used in conjunction with just the names to ensure the two entities match (sector, internal information on the company structure, etc.)
  • Edit the data:
    • If you are happy with the match, set the score value to 1.
    • Otherwise set or leave the score value to anything other than 1.
  • Save the edited file as, say, valid_matches.csv.

Re-read the edited file (validated) with:

# Reading from current working directory
valid_matches <- readr::read_csv("valid_matches.csv")

Prioritize validated matches by level

The validated dataset may have multiple matches per loan. Consider the case where a loan is given to “Acme Power USA”, a subsidiary of “Acme Power Co.”. There may be both “Acme Power USA” and “Acme Power Co.” in the abcd, and so there could be two valid matches for this loan. To get the best match only, use prioritize() – it picks rows where score is 1 and level per loan is of highest priority():

# Pretend we validated the matched dataset
valid_matches <- matched

some_interesting_columns <- c("id_2dii", "level", "score")

valid_matches %>%
  prioritize() %>%
  select(all_of(some_interesting_columns))
#> # A tibble: 168 × 3
#>    id_2dii level            score
#>    <chr>   <chr>            <dbl>
#>  1 DL147   direct_loantaker     1
#>  2 DL165   direct_loantaker     1
#>  3 DL310   direct_loantaker     1
#>  4 DL100   direct_loantaker     1
#>  5 DL5     direct_loantaker     1
#>  6 DL92    direct_loantaker     1
#>  7 DL172   direct_loantaker     1
#>  8 DL3     direct_loantaker     1
#>  9 DL74    direct_loantaker     1
#> 10 DL91    direct_loantaker     1
#> # ℹ 158 more rows

By default, highest priority refers to the most granular match (direct_loantaker). The default priority is set internally via prioritize_levels().

prioritize_level(matched)
#> [1] "direct_loantaker"      "intermediate_parent_1" "ultimate_parent"

You may use a different priority. One way to do that is to pass a function to priority. For example, use rev to reverse the default priority.

matched %>%
  prioritize(priority = rev) %>%
  select(all_of(some_interesting_columns))
#> # A tibble: 168 × 3
#>    id_2dii level           score
#>    <chr>   <chr>           <dbl>
#>  1 UP220   ultimate_parent     1
#>  2 UP115   ultimate_parent     1
#>  3 UP43    ultimate_parent     1
#>  4 UP74    ultimate_parent     1
#>  5 UP258   ultimate_parent     1
#>  6 UP154   ultimate_parent     1
#>  7 UP13    ultimate_parent     1
#>  8 UP21    ultimate_parent     1
#>  9 UP156   ultimate_parent     1
#> 10 UP149   ultimate_parent     1
#> # ℹ 158 more rows