
Match a loanbook to asset-based company data (abcd) by the name_*
columns
Source: R/match_name.R
match_name.Rd
match_name()
scores the match between names in a loanbook dataset (columns
can be name_direct_loantaker
, name_intermediate_parent*
and
name_ultimate_parent
) with names in an asset-based company data (column
name_company
). The raw names are first internally transformed, and aliases
are assigned. The similarity between aliases in each of the loanbook and abcd
is scored using stringdist::stringsim()
.
Usage
match_name(
loanbook,
abcd,
by_sector = TRUE,
min_score = 0.8,
method = "jw",
p = 0.1,
overwrite = NULL,
ald = deprecated(),
...
)
Arguments
- loanbook, abcd
data frames structured like r2dii.data::loanbook_demo and r2dii.data::abcd_demo.
- by_sector
Should names only be compared if companies belong to the same
sector
?- min_score
A number between 0-1, to set the minimum
score
threshold. Ascore
of 1 is a perfect match.- method
Method for distance calculation. One of
c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
. See stringdist::stringdist-metrics.- p
Prefix factor for Jaro-Winkler distance. The valid range for
p
is0 <= p <= 0.25
. Ifp=0
(default), the Jaro-distance is returned. Applies only tomethod='jw'
.- overwrite
A data frame used to overwrite the
sector
and/orname
columns of a particular direct loantaker or ultimate parent. To overwrite onlysector
, the value in thename
column should beNA
and vice-versa. This file can be used to manually match loanbook companies to abcd.- ald
- ...
Arguments passed on to
stringdist::stringsim()
.
Value
A data frame with the same groups (if any) and columns as loanbook
,
and the additional columns:
id_2dii
- an id used internally bymatch_name()
to distinguish companieslevel
- the level of granularity that the loan was matched at (e.gdirect_loantaker
orultimate_parent
)sector
- the sector of theloanbook
companysector_abcd
- the sector of theabcd
companyname
- the name of theloanbook
companyname_abcd
- the name of theabcd
companyscore
- the score of the match (manually set this to1
prior to callingprioritize()
to validate the match)source
- determines the source of the match. (equal toloanbook
unless the match is fromoverwrite
The returned rows depend on the argument min_value
and the result of the
column score
for each loan: * If any row has score
equal to 1,
match_name()
returns all rows where score
equals 1, dropping all other
rows. * If no row has score
equal to 1,match_name()
returns all rows
where score
is equal to or greater than min_score
. * If there is no
match the output is a 0-row tibble with the expected column names -- for
type stability.
Package options
r2dii.match.sector_classifications
: Allows you to use your own
sector_classififications
instead of the default. This feature is
experimental and may be dropped and/or become a new argument to
match_name()
.
Assigning aliases
The transformation process used to compare names between loanbook and abcd datasets applies best practices commonly used in name matching algorithms:
Remove special characters.
Replace language specific characters.
Abbreviate certain names to reduce their importance in the matching.
Spell out numbers to increase their importance.
See also
Other main functions:
prioritize()
Examples
library(r2dii.data)
library(tibble)
# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)
match_name(loanbook, abcd)
#> # A tibble: 2 × 28
#> id_loan id_direct_lo…¹ name_…² id_in…³ name_…⁴ id_ul…⁵ name_…⁶ loan_…⁷ loan_…⁸
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 L14 C296 Yuasfn… NA NA UP3 Affini… 187577 EUR
#> 2 L15 C295 Yuanbs… NA NA UP196 Noshir… 192217 EUR
#> # … with 19 more variables: loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_input_type <chr>,
#> # sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> # flag_project_finance_loan <chr>, name_project <lgl>,
#> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>,
#> # level <chr>, sector <chr>, sector_abcd <chr>, name <chr>, …
match_name(loanbook, abcd, min_score = 0.9)
#> # A tibble: 1 × 28
#> id_loan id_direct_lo…¹ name_…² id_in…³ name_…⁴ id_ul…⁵ name_…⁶ loan_…⁷ loan_…⁸
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 L14 C296 Yuasfn… NA NA UP3 Affini… 187577 EUR
#> # … with 19 more variables: loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_input_type <chr>,
#> # sector_classification_direct_loantaker <dbl>, fi_type <chr>,
#> # flag_project_finance_loan <chr>, name_project <lgl>,
#> # lei_direct_loantaker <lgl>, isin_direct_loantaker <lgl>, id_2dii <chr>,
#> # level <chr>, sector <chr>, sector_abcd <chr>, name <chr>, …
# Use your own `sector_classifications`
your_classifications <- tibble(
sector = "power",
borderline = FALSE,
code = "3511",
code_system = "XYZ"
)
restore <- options(r2dii.match.sector_classifications = your_classifications)
loanbook <- tibble(
sector_classification_system = "XYZ",
sector_classification_direct_loantaker = "3511",
id_ultimate_parent = "UP15",
name_ultimate_parent = "Alpine Knits India Pvt. Limited",
id_direct_loantaker = "C294",
name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)
abcd <- tibble(
name_company = "alpine knits india pvt. limited",
sector = "power"
)
match_name(loanbook, abcd)
#> # A tibble: 1 × 15
#> sector_…¹ secto…² id_ul…³ name_…⁴ id_di…⁵ name_…⁶ id_2dii level sector secto…⁷
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 XYZ 3511 UP15 Alpine… C294 Yuamen… UP1 ulti… power power
#> # … with 5 more variables: name <chr>, name_abcd <chr>, score <dbl>,
#> # source <chr>, borderline <lgl>, and abbreviated variable names
#> # ¹sector_classification_system, ²sector_classification_direct_loantaker,
#> # ³id_ultimate_parent, ⁴name_ultimate_parent, ⁵id_direct_loantaker,
#> # ⁶name_direct_loantaker, ⁷sector_abcd
# Cleanup
options(restore)