Skip to contents

Setup

We use the r2dii.* suite of packages to access the most important functions you’ll learn about. We also use example datasets from the package r2dii.data, and optional but convenient functions from the packages dplyr and readxl.

Import data

We need two datasets to begin: a loanbook and an asset-based company dataset (abcd).

To simulate having these data files prepared, we will use the demo data included in the r2dii.data package to create example XLS files using the writexl package.

library(writexl)

writexl::write_xlsx(
  x = r2dii.data::loanbook_demo,
  path = file.path(tempdir(), "loanbook.xlsx")
)

writexl::write_xlsx(
  x = r2dii.data::abcd_demo,
  path = file.path(tempdir(), "abcd.xlsx")
)

Typically, these data files are stored in XLS files on your computer. As an example, they could be imported as in the code below using the readxl package, however, if you use custom data that does not strictly follow the format as defined in the previous section, you may need to adjust the import code and/or do some data manipulation in R after importing.

loanbook <- readxl::read_excel(path = file.path(tempdir(), "loanbook.xlsx"))

loanbook
#> # A tibble: 283 × 13
#>    id_loan id_direct_loantaker name_direct_loantaker       id_ultimate_parent
#>    <chr>   <chr>               <chr>                       <chr>             
#>  1 L1      C294                Vitale Group                UP15              
#>  2 L2      C293                Moen-Moen                   UP84              
#>  3 L3      C292                Rowe-Rowe                   UP288             
#>  4 L4      C299                Fadel-Fadel                 UP54              
#>  5 L5      C305                Ring AG & Co. KGaA          UP104             
#>  6 L6      C304                Kassulke-Kassulke           UP83              
#>  7 L7      C227                Morissette Group            UP134             
#>  8 L8      C303                Barone s.r.l.               UP163             
#>  9 L9      C301                Werner Werner AG & Co. KGaA UP138             
#> 10 L10     C302                De rosa s.r.l.              UP32              
#> # ℹ 273 more rows
#> # ℹ 9 more variables: name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> #   loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_direct_loantaker <chr>, lei_direct_loantaker <chr>,
#> #   isin_direct_loantaker <lgl>

abcd <- readxl::read_excel(path = file.path(tempdir(), "abcd.xlsx"))

abcd
#> # A tibble: 4,972 × 12
#>    company_id name_company         lei   sector technology production_unit  year
#>    <chr>      <chr>                <chr> <chr>  <chr>      <chr>           <dbl>
#>  1 175        Giordano, Giordano … 1850… power  hydrocap   MW               2020
#>  2 175        Giordano, Giordano … 1850… power  hydrocap   MW               2021
#>  3 175        Giordano, Giordano … 1850… power  hydrocap   MW               2022
#>  4 175        Giordano, Giordano … 1850… power  hydrocap   MW               2023
#>  5 175        Giordano, Giordano … 1850… power  hydrocap   MW               2024
#>  6 175        Giordano, Giordano … 1850… power  hydrocap   MW               2025
#>  7 526        Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year  2020
#>  8 526        Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year  2021
#>  9 526        Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year  2022
#> 10 526        Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year  2023
#> # ℹ 4,962 more rows
#> # ℹ 5 more variables: production <dbl>, emission_factor <dbl>,
#> #   plant_location <chr>, is_ultimate_owner <lgl>, emission_factor_unit <chr>

Score the match precision between the loanbook and abcd datasets

match_name() scores the match between names in a loanbook dataset and names in an asset-based company dataset. The names come from the columns name_direct_loantaker and name_ultimate_parent of the loanbook dataset, and from the column name_company of the asset-based company dataset. In the loan book data set, it is possible to optionally add any number of name_intermediate_parent_* columns, where * indicates the level up the corporate tree from direct_loantaker.

The raw names are internally transformed applying best-practices commonly used in name matching algorithms, such as:

  • Remove special characters
  • Replace language specific characters
  • Abbreviate certain names to reduce their importance in the matching
  • Remove corporate suffixes when necessary
  • Spell out numbers to increase their importance

The similarity is then scored between the internally-transformed names of the loanbook against the names in the abcd. For more information on the scoring algorithm used, see stringdist::stringsim().

match_name(loanbook, abcd)
#> # A tibble: 326 × 22
#>    id_loan id_direct_loantaker name_direct_loantaker       id_ultimate_parent
#>    <chr>   <chr>               <chr>                       <chr>             
#>  1 L1      C294                Vitale Group                UP15              
#>  2 L3      C292                Rowe-Rowe                   UP288             
#>  3 L5      C305                Ring AG & Co. KGaA          UP104             
#>  4 L6      C304                Kassulke-Kassulke           UP83              
#>  5 L6      C304                Kassulke-Kassulke           UP83              
#>  6 L7      C227                Morissette Group            UP134             
#>  7 L7      C227                Morissette Group            UP134             
#>  8 L8      C303                Barone s.r.l.               UP163             
#>  9 L9      C301                Werner Werner AG & Co. KGaA UP138             
#> 10 L9      C301                Werner Werner AG & Co. KGaA UP138             
#> # ℹ 316 more rows
#> # ℹ 18 more variables: name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> #   loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> #   loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> #   sector_classification_direct_loantaker <chr>, lei_direct_loantaker <chr>,
#> #   isin_direct_loantaker <lgl>, id_2dii <chr>, level <chr>, sector <chr>,
#> #   sector_abcd <chr>, name <chr>, name_abcd <chr>, score <dbl>, …

match_name() defaults to scoring matches between name strings that belong to the same sector. Using by_sector = FALSE removes this limitation – increasing computation time, and the number of potentially incorrect matches to manually validate.

match_name(loanbook, abcd, by_sector = FALSE) %>% nrow()
#> [1] 656

# Compare to
match_name(loanbook, abcd, by_sector = TRUE) %>% nrow()
#> [1] 326

The min_score argument allows you to set a minimum threshold score.

match_name(loanbook, abcd, min_score = 0.9) %>%
  pull(score) %>%
  range()
#> [1] 0.9007692 1.0000000

Maybe overwrite matches

If you are happy with the matching coverage achieved, proceed to the next step. Otherwise, you can manually add matches, not found automatically by match_name(). To do this, manually inspect the abcd and find a company you would like to match to your loanbook. Once a match is found, use excel to write a .csv file similar to r2dii.data::overwrite_demo, where:

  • level indicates the level that the manual match should be added to (e.g. direct_loantaker)
  • id_2dii is the id of the loanbook company you would like to match (from the output of match_name())
  • name is the abcd company you would like to manually link to
  • sector optionally you can also overwrite the sector.
  • source this can be used later to determine where all manual matches came from.
matched <-
  match_name(
    loanbook = loanbook,
    abcd = abcd,
    min_score = 0.9,
    overwrite = r2dii.data::overwrite_demo
  )
#> Warning: You should only overwrite a sector at the level of the 'direct
#> loantaker' (DL). If you overwrite a sector at the level of the 'ultimate
#> parent' (UP) you consequently overwrite all children of that sector,
#> which most likely is a mistake.

Notice the warning.

Validate matches

Write the output of match_name() into a .csv file, for example using readr to save a CSV file:

readr::write_csv(x = matched, file = file.path(tempdir(), "matched.csv"))

Compare, edit, and save the data manually:

  • Open matched.csv with any spreadsheet editor (Excel, Google Sheets, etc.).
  • Compare the columns name and name_abcd manually to determine if the match is valid. Other information can be used in conjunction with just the names to ensure the two entities match (sector, internal information on the company structure, etc.)
  • Edit the data:
    • If you are happy with the match, set the score value to 1.
    • Otherwise set or leave the score value to anything other than 1.
  • Save the edited file as, for example, valid_matches.csv.

Re-import the edited file (validated), for example using readr to read a CSV file:

valid_matches <- readr::read_csv(file = file.path(tempdir(), "valid_matches.csv"), show_col_types = FALSE)

For more information on validating matches, please see the documentation for prioritize().

Prioritize validated matches by level

The validated dataset may have multiple matches per loan. Consider the case where a loan is given to “Acme Power USA”, a subsidiary of “Acme Power Co.”. There may be both “Acme Power USA” and “Acme Power Co.” in the abcd, and so there could be two valid matches for this loan. To get the best match only, use prioritize() – it picks rows where score is 1 and level per loan is of highest priority:

some_interesting_columns <- c("id_2dii", "level", "score")

prioritized_matches <-
  valid_matches %>%
  prioritize()

prioritized_matches %>% select(all_of(some_interesting_columns))
#> # A tibble: 175 × 3
#>    id_2dii level            score
#>    <chr>   <chr>            <dbl>
#>  1 DL129   direct_loantaker     1
#>  2 DL144   direct_loantaker     1
#>  3 DL270   direct_loantaker     1
#>  4 DL86    direct_loantaker     1
#>  5 DL5     direct_loantaker     1
#>  6 DL80    direct_loantaker     1
#>  7 DL150   direct_loantaker     1
#>  8 DL3     direct_loantaker     1
#>  9 DL65    direct_loantaker     1
#> 10 DL79    direct_loantaker     1
#> # ℹ 165 more rows

By default, highest priority refers to the most granular match (direct_loantaker). The default priority is set internally via prioritize_level().

prioritize_level(valid_matches)
#> [1] "direct_loantaker" "ultimate_parent"

You may use a different priority. One way to do that is to pass a function to priority. For example, use rev() to reverse the default priority.

valid_matches %>%
  prioritize(priority = rev) %>%
  select(all_of(some_interesting_columns))
#> # A tibble: 175 × 3
#>    id_2dii level           score
#>    <chr>   <chr>           <dbl>
#>  1 UP190   ultimate_parent     1
#>  2 UP101   ultimate_parent     1
#>  3 UP39    ultimate_parent     1
#>  4 UP63    ultimate_parent     1
#>  5 UP224   ultimate_parent     1
#>  6 UP132   ultimate_parent     1
#>  7 UP12    ultimate_parent     1
#>  8 UP20    ultimate_parent     1
#>  9 UP134   ultimate_parent     1
#> 10 UP127   ultimate_parent     1
#> # ℹ 165 more rows

A Note on Sector Classifications and the borderline Flag

A note on sector classification: Matches are preferred when the sector from the loanbook matches the sector from the abcd. The loanbook sector is determined internally using the sector_classification_system and sector_classification_direct_loantaker columns. Currently, we only allow a couple specific values for sector_classification_system:

GICS, ISIC, NACE, NAICS, PSIC, SIC

If you would like to use a different classification system, please raise an issue in r2dii.data and we can incorporate it.

There are a zoo of sector classification code systems out there. Some are granular, some are not. Since we currently cover a particular portion of the supply chain (i.e. production), it is important we try to only match the ABCD with companies that are actually active in this portion of the supply chain.

An issue arises when, for example, a company is classified in the “power transmission” sector. In a perfect world, these companies would produce no electricity, and we would not try to match them. In practice, however, we find there is often overlap. For this reason, we introduced the borderline flag.

In the example below, we see two classification codes coming from the SIC classification standard:

r2dii.data::nace_classification %>% 
  filter(code %in% c("D35.11", "D35.14"))
#> # A tibble: 2 × 6
#>   original_code description                      code  sector borderline version
#>   <chr>         <chr>                            <chr> <chr>  <lgl>      <chr>  
#> 1 35.11         35.11 Production of electricity… D35.… power  FALSE      2.1    
#> 2 35.14         35.14 Distribution of electrici… D35.… power  TRUE       2.1

Notice that the code D35.11 corresponds to power generation. This is an identical match to PACTA’s power sector, and thus the borderline flag is set to FALSE. In contrast, code D35.14 corresponds to the distribution of electricity. In a perfect world, we would set this code to not in scope, however there is still a chance that these companies produce electricity. For this reason, we have mapped it to power with borderline = TRUE.

In practice, if a company has a borderline of TRUE and is matched, then consider the company in scope. If it has a borderline of TRUE and isn’t matched, then consider it out of scope.

Calculate targets

You can calculate scenario targets using two different approaches: Market Share Approach, or Sectoral Decarbonization Approach.

Market Share Approach

The Market Share Approach is used to calculate scenario targets for the production of a technology in a sector. For example, we can use this approach to set targets for the production of electric vehicles in the automotive sector. This approach is recommended for sectors where a granular technology scenario roadmap exists.

Targets can be set at the portfolio level:

# Use these datasets to practice but eventually you should use your own data.
scenario <- r2dii.data::scenario_demo_2020
regions <- r2dii.data::region_isos_demo

market_share_targets_portfolio <-
  target_market_share(
    data = prioritized_matches,
    abcd = abcd,
    scenario = scenario,
    region_isos = regions
  )

market_share_targets_portfolio
#> # A tibble: 1,076 × 10
#>    sector     technology  year region scenario_source metric     production
#>    <chr>      <chr>      <dbl> <chr>  <chr>           <chr>           <dbl>
#>  1 automotive electric    2020 global demo_2020       projected     139052.
#>  2 automotive electric    2020 global demo_2020       target_cps    139052.
#>  3 automotive electric    2020 global demo_2020       target_sds    139052.
#>  4 automotive electric    2020 global demo_2020       target_sps    139052.
#>  5 automotive electric    2021 global demo_2020       projected     140395.
#>  6 automotive electric    2021 global demo_2020       target_cps    140195.
#>  7 automotive electric    2021 global demo_2020       target_sds    145990.
#>  8 automotive electric    2021 global demo_2020       target_sps    140505.
#>  9 automotive electric    2022 global demo_2020       projected     141737.
#> 10 automotive electric    2022 global demo_2020       target_cps    141314.
#> # ℹ 1,066 more rows
#> # ℹ 3 more variables: technology_share <dbl>, scope <chr>,
#> #   percentage_of_initial_production_by_scope <dbl>

Or at the company level:

market_share_targets_company <-
  target_market_share(
    data = prioritized_matches,
    abcd = abcd,
    scenario = scenario,
    region_isos = regions,
    by_company = TRUE, # Output results at company-level
    weight_production = FALSE
  )

market_share_targets_company
#> # A tibble: 14,505 × 11
#>    sector    technology  year region scenario_source name_abcd metric production
#>    <chr>     <chr>      <dbl> <chr>  <chr>           <chr>     <chr>       <dbl>
#>  1 automoti… electric    2020 global demo_2020       Bernardi… proje…    324592.
#>  2 automoti… electric    2020 global demo_2020       Bernardi… targe…    324592.
#>  3 automoti… electric    2020 global demo_2020       Bernardi… targe…    324592.
#>  4 automoti… electric    2020 global demo_2020       Bernardi… targe…    324592.
#>  5 automoti… electric    2020 global demo_2020       Christia… proje…    512438.
#>  6 automoti… electric    2020 global demo_2020       Christia… targe…    512438.
#>  7 automoti… electric    2020 global demo_2020       Christia… targe…    512438.
#>  8 automoti… electric    2020 global demo_2020       Christia… targe…    512438.
#>  9 automoti… electric    2020 global demo_2020       Donati, … proje…    277214.
#> 10 automoti… electric    2020 global demo_2020       Donati, … targe…    277214.
#> # ℹ 14,495 more rows
#> # ℹ 3 more variables: technology_share <dbl>, scope <chr>,
#> #   percentage_of_initial_production_by_scope <dbl>

Sectoral Decarbonization Approach

The Sectoral Decarbonization Approach is used to calculate scenario targets for the emission_factor of a sector. For example, you can use this approach to set targets for the average emission factor of the cement sector. This approach is recommended for sectors lacking technology roadmaps.

# Use this dataset to practice but eventually you should use your own data.
co2 <- r2dii.data::co2_intensity_scenario_demo

sda_targets <-
  target_sda(
    data = prioritized_matches,
    abcd = abcd,
    co2_intensity_scenario = co2,
    region_isos = regions
  ) %>% 
  filter(sector == "cement", year >= 2020)
#> Warning: Removing rows in abcd where `emission_factor` is NA

sda_targets
#> # A tibble: 110 × 6
#>    sector  year region             scenario_source emission_factor_metric
#>    <chr>  <dbl> <chr>              <chr>           <chr>                 
#>  1 cement  2020 advanced economies demo_2020       projected             
#>  2 cement  2020 developing asia    demo_2020       projected             
#>  3 cement  2020 global             demo_2020       projected             
#>  4 cement  2021 advanced economies demo_2020       projected             
#>  5 cement  2021 developing asia    demo_2020       projected             
#>  6 cement  2021 global             demo_2020       projected             
#>  7 cement  2022 advanced economies demo_2020       projected             
#>  8 cement  2022 developing asia    demo_2020       projected             
#>  9 cement  2022 global             demo_2020       projected             
#> 10 cement  2023 advanced economies demo_2020       projected             
#> # ℹ 100 more rows
#> # ℹ 1 more variable: emission_factor_value <dbl>

PREVIOUS CHAPTER: Preparatory Steps

NEXT CHAPTER: Interpretation of Results