
Match a loanbook to asset-based company data (abcd) by the name_* columns
Source: R/match_name.R
match_name.Rdmatch_name() scores the match between names in a loanbook dataset (columns
can be name_direct_loantaker, name_intermediate_parent* and
name_ultimate_parent) with names in an asset-based company data (column
name_company). The raw names are first internally transformed, and aliases
are assigned. The similarity between aliases in each of the loanbook and abcd
is scored using stringdist::stringsim().
Usage
match_name(
loanbook,
abcd,
by_sector = TRUE,
min_score = 0.8,
method = "jw",
p = 0.1,
overwrite = NULL,
join_id = NULL,
sector_classification = default_sector_classification(),
...
)Arguments
- loanbook, abcd
data frames structured like r2dii.data::loanbook_demo and r2dii.data::abcd_demo.
- by_sector
Should names only be compared if companies belong to the same
sector?- min_score
A number between 0-1, to set the minimum
scorethreshold. Ascoreof 1 is a perfect match.- method
Method for distance calculation. One of
c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex"). See stringdist::stringdist-metrics.- p
Prefix factor for Jaro-Winkler distance. The valid range for
pis0 <= p <= 0.25. Ifp=0(default), the Jaro-distance is returned. Applies only tomethod='jw'.- overwrite
A data frame used to overwrite the
sectorand/ornamecolumns of a particular direct loantaker or ultimate parent. To overwrite onlysector, the value in thenamecolumn should beNAand vice-versa. This file can be used to manually match loanbook companies to abcd.- join_id
A join specification passed to
dplyr::inner_join(). If a character string, it assumes identical join columns betweenloanbookandabcd. If a named character vector, it uses the name as the join column ofloanbookand the value as the join column ofabcd.- sector_classification
A data frame containing sector classifications in the same format as
r2dii.data::sector_classifications. The default value isr2dii.data::sector_classifications.- ...
Arguments passed on to
stringdist::stringsim().
Value
A data frame with the same groups (if any) and columns as loanbook,
and the additional columns:
id_2dii- an id used internally bymatch_name()to distinguish companieslevel- the level of granularity that the loan was matched at (e.gdirect_loantakerorultimate_parent)sector- the sector of theloanbookcompanysector_abcd- the sector of theabcdcompanyname- the name of theloanbookcompanyname_abcd- the name of theabcdcompanyscore- the score of the match (manually set this to1prior to callingprioritize()to validate the match)source- determines the source of the match. (equal toloanbookunless the match is fromoverwrite
The returned rows depend on the argument min_value and the result of the
column score for each loan: * If any row has score equal to 1,
match_name() returns all rows where score equals 1, dropping all other
rows. * If no row has score equal to 1,match_name() returns all rows
where score is equal to or greater than min_score. * If there is no
match the output is a 0-row tibble with the expected column names – for
type stability.
Assigning aliases
The transformation process used to compare names between loanbook and abcd datasets applies best practices commonly used in name matching algorithms:
Remove special characters.
Replace language specific characters.
Abbreviate certain names to reduce their importance in the matching.
Spell out numbers to increase their importance.
See also
Other main functions:
prioritize()
Examples
library(r2dii.data)
#>
#> Attaching package: ‘r2dii.data’
#> The following object is masked from ‘package:r2dii.match’:
#>
#> data_dictionary
library(tibble)
# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)
match_name(loanbook, abcd)
#> # A tibble: 6 × 22
#> id_loan id_direct_loantaker name_direct_loantaker id_ultimate_parent
#> <chr> <chr> <chr> <chr>
#> 1 L14 C296 Gallo e figli UP3
#> 2 L17 C290 De luca, De luca e De luca e f… UP6
#> 3 L21 C286 Gallo Group UP63
#> 4 L30 C274 Sanna SPA UP33
#> 5 L31 C273 Caputo SPA UP5
#> 6 L48 C256 Hellwig AG UP201
#> # ℹ 18 more variables: name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_direct_loantaker <chr>, lei_direct_loantaker <chr>,
#> # isin_direct_loantaker <chr>, id_2dii <chr>, level <chr>, sector <chr>,
#> # sector_abcd <chr>, name <chr>, name_abcd <chr>, score <dbl>, source <chr>,
#> # borderline <lgl>
match_name(loanbook, abcd, min_score = 0.9)
#> # A tibble: 3 × 22
#> id_loan id_direct_loantaker name_direct_loantaker id_ultimate_parent
#> <chr> <chr> <chr> <chr>
#> 1 L14 C296 Gallo e figli UP3
#> 2 L17 C290 De luca, De luca e De luca e f… UP6
#> 3 L31 C273 Caputo SPA UP5
#> # ℹ 18 more variables: name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_direct_loantaker <chr>, lei_direct_loantaker <chr>,
#> # isin_direct_loantaker <chr>, id_2dii <chr>, level <chr>, sector <chr>,
#> # sector_abcd <chr>, name <chr>, name_abcd <chr>, score <dbl>, source <chr>,
#> # borderline <lgl>
# match on LEI
loanbook <- tibble(
sector_classification_system = "NACE",
sector_classification_direct_loantaker = "D35.11",
id_ultimate_parent = "UP15",
name_ultimate_parent = "Won't fuzzy match",
id_direct_loantaker = "C294",
name_direct_loantaker = "Won't fuzzy match",
lei_direct_loantaker = "LEI123"
)
abcd <- tibble(
name_company = "alpine knits india pvt. limited",
sector = "power",
lei = "LEI123"
)
match_name(loanbook, abcd, join_id = c(lei_direct_loantaker = "lei"))
#> # A tibble: 1 × 15
#> sector_classification_system sector_classification_direct…¹ id_ultimate_parent
#> <chr> <chr> <chr>
#> 1 NACE D35.11 UP15
#> # ℹ abbreviated name: ¹sector_classification_direct_loantaker
#> # ℹ 12 more variables: name_ultimate_parent <chr>, id_direct_loantaker <chr>,
#> # name_direct_loantaker <chr>, lei_direct_loantaker <chr>, sector <chr>,
#> # borderline <lgl>, name_abcd <chr>, sector_abcd <chr>, score <dbl>,
#> # source <chr>, level <chr>, name <chr>
# Use your own `sector_classifications`
your_classifications <- tibble(
sector = "power",
borderline = FALSE,
code = "D35.11",
code_system = "XYZ"
)
loanbook <- tibble(
sector_classification_system = "XYZ",
sector_classification_direct_loantaker = "D35.11",
id_ultimate_parent = "UP15",
name_ultimate_parent = "Alpine Knits India Pvt. Limited",
id_direct_loantaker = "C294",
name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)
abcd <- tibble(
name_company = "alpine knits india pvt. limited",
sector = "power"
)
match_name(loanbook, abcd, sector_classification = your_classifications)
#> # A tibble: 1 × 15
#> sector_classification_system sector_classification_direct…¹ id_ultimate_parent
#> <chr> <chr> <chr>
#> 1 XYZ D35.11 UP15
#> # ℹ abbreviated name: ¹sector_classification_direct_loantaker
#> # ℹ 12 more variables: name_ultimate_parent <chr>, id_direct_loantaker <chr>,
#> # name_direct_loantaker <chr>, id_2dii <chr>, level <chr>, sector <chr>,
#> # sector_abcd <chr>, name <chr>, name_abcd <chr>, score <dbl>, source <chr>,
#> # borderline <lgl>