Match a loanbook to asset-based company data (abcd) by the name_*
columns
Source: R/match_name.R
match_name.Rd
match_name()
scores the match between names in a loanbook dataset (columns
can be name_direct_loantaker
, name_intermediate_parent*
and
name_ultimate_parent
) with names in an asset-based company data (column
name_company
). The raw names are first internally transformed, and aliases
are assigned. The similarity between aliases in each of the loanbook and abcd
is scored using stringdist::stringsim()
.
Usage
match_name(
loanbook,
abcd,
by_sector = TRUE,
min_score = 0.8,
method = "jw",
p = 0.1,
overwrite = NULL,
join_id = NULL,
sector_classification = default_sector_classification(),
...
)
Arguments
- loanbook, abcd
data frames structured like r2dii.data::loanbook_demo and r2dii.data::abcd_demo.
- by_sector
Should names only be compared if companies belong to the same
sector
?- min_score
A number between 0-1, to set the minimum
score
threshold. Ascore
of 1 is a perfect match.- method
Method for distance calculation. One of
c("osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw", "soundex")
. See stringdist::stringdist-metrics.- p
Prefix factor for Jaro-Winkler distance. The valid range for
p
is0 <= p <= 0.25
. Ifp=0
(default), the Jaro-distance is returned. Applies only tomethod='jw'
.- overwrite
A data frame used to overwrite the
sector
and/orname
columns of a particular direct loantaker or ultimate parent. To overwrite onlysector
, the value in thename
column should beNA
and vice-versa. This file can be used to manually match loanbook companies to abcd.- join_id
A join specification passed to
dplyr::inner_join()
. If a character string, it assumes identical join columns betweenloanbook
andabcd
. If a named character vector, it uses the name as the join column ofloanbook
and the value as the join column ofabcd
.- sector_classification
A data frame containing sector classifications in the same format as
r2dii.data::sector_classifications
. The default value isr2dii.data::sector_classifications
.- ...
Arguments passed on to
stringdist::stringsim()
.
Value
A data frame with the same groups (if any) and columns as loanbook
,
and the additional columns:
id_2dii
- an id used internally bymatch_name()
to distinguish companieslevel
- the level of granularity that the loan was matched at (e.gdirect_loantaker
orultimate_parent
)sector
- the sector of theloanbook
companysector_abcd
- the sector of theabcd
companyname
- the name of theloanbook
companyname_abcd
- the name of theabcd
companyscore
- the score of the match (manually set this to1
prior to callingprioritize()
to validate the match)source
- determines the source of the match. (equal toloanbook
unless the match is fromoverwrite
The returned rows depend on the argument min_value
and the result of the
column score
for each loan: * If any row has score
equal to 1,
match_name()
returns all rows where score
equals 1, dropping all other
rows. * If no row has score
equal to 1,match_name()
returns all rows
where score
is equal to or greater than min_score
. * If there is no
match the output is a 0-row tibble with the expected column names – for
type stability.
Assigning aliases
The transformation process used to compare names between loanbook and abcd datasets applies best practices commonly used in name matching algorithms:
Remove special characters.
Replace language specific characters.
Abbreviate certain names to reduce their importance in the matching.
Spell out numbers to increase their importance.
See also
Other main functions:
prioritize()
Examples
if (FALSE) { # \dontrun{
library(r2dii.data)
library(tibble)
# Small data for examples
loanbook <- head(loanbook_demo, 50)
abcd <- head(abcd_demo, 50)
match_name(loanbook, abcd)
match_name(loanbook, abcd, min_score = 0.9)
# match on LEI
loanbook <- tibble(
sector_classification_system = "NACE",
sector_classification_direct_loantaker = "D35.11",
id_ultimate_parent = "UP15",
name_ultimate_parent = "Won't fuzzy match",
id_direct_loantaker = "C294",
name_direct_loantaker = "Won't fuzzy match",
lei_direct_loantaker = "LEI123"
)
abcd <- tibble(
name_company = "alpine knits india pvt. limited",
sector = "power",
lei = "LEI123"
)
match_name(loanbook, abcd, join_id = c(lei_direct_loantaker = "lei"))
# Use your own `sector_classifications`
your_classifications <- tibble(
sector = "power",
borderline = FALSE,
code = "D35.11",
code_system = "XYZ"
)
loanbook <- tibble(
sector_classification_system = "XYZ",
sector_classification_direct_loantaker = "D35.11",
id_ultimate_parent = "UP15",
name_ultimate_parent = "Alpine Knits India Pvt. Limited",
id_direct_loantaker = "C294",
name_direct_loantaker = "Yuamen Xinneng Thermal Power Co Ltd"
)
abcd <- tibble(
name_company = "alpine knits india pvt. limited",
sector = "power"
)
match_name(loanbook, abcd, sector_classification = your_classifications)
# Cleanup
options(restore)
} # }