Running the Analysis
Source:vignettes/cookbook_running_the_analysis.Rmd
cookbook_running_the_analysis.Rmd
Setup
We use the r2dii.*
suite of packages to access the most
important functions you’ll learn about. We also use example datasets
from the package r2dii.data, and optional but convenient
functions from the packages dplyr and
readxl.
library(r2dii.match)
library(r2dii.analysis)
library(r2dii.plot)
library(r2dii.data)
library(dplyr, warn.conflicts = FALSE)
library(readxl)
Import data
We need two datasets to begin: a loanbook and an asset-based company dataset (abcd).
To simulate having these data files prepared, we will use the demo data included in the r2dii.data package to create example XLS files using the writexl package.
library(writexl)
writexl::write_xlsx(
x = r2dii.data::loanbook_demo,
path = file.path(tempdir(), "loanbook.xlsx")
)
writexl::write_xlsx(
x = r2dii.data::abcd_demo,
path = file.path(tempdir(), "abcd.xlsx")
)
Typically, these data files are stored in XLS files on your computer. As an example, they could be imported as in the code below using the readxl package, however, if you use custom data that does not strictly follow the format as defined in the previous section, you may need to adjust the import code and/or do some data manipulation in R after importing.
loanbook <- readxl::read_excel(path = file.path(tempdir(), "loanbook.xlsx"))
loanbook
#> # A tibble: 283 × 13
#> id_loan id_direct_loantaker name_direct_loantaker id_ultimate_parent
#> <chr> <chr> <chr> <chr>
#> 1 L1 C294 Vitale Group UP15
#> 2 L2 C293 Moen-Moen UP84
#> 3 L3 C292 Rowe-Rowe UP288
#> 4 L4 C299 Fadel-Fadel UP54
#> 5 L5 C305 Ring AG & Co. KGaA UP104
#> 6 L6 C304 Kassulke-Kassulke UP83
#> 7 L7 C227 Morissette Group UP134
#> 8 L8 C303 Barone s.r.l. UP163
#> 9 L9 C301 Werner Werner AG & Co. KGaA UP138
#> 10 L10 C302 De rosa s.r.l. UP32
#> # ℹ 273 more rows
#> # ℹ 9 more variables: name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_direct_loantaker <chr>, lei_direct_loantaker <chr>,
#> # isin_direct_loantaker <lgl>
abcd <- readxl::read_excel(path = file.path(tempdir(), "abcd.xlsx"))
abcd
#> # A tibble: 4,972 × 12
#> company_id name_company lei sector technology production_unit year
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 175 Giordano, Giordano … 1850… power hydrocap MW 2020
#> 2 175 Giordano, Giordano … 1850… power hydrocap MW 2021
#> 3 175 Giordano, Giordano … 1850… power hydrocap MW 2022
#> 4 175 Giordano, Giordano … 1850… power hydrocap MW 2023
#> 5 175 Giordano, Giordano … 1850… power hydrocap MW 2024
#> 6 175 Giordano, Giordano … 1850… power hydrocap MW 2025
#> 7 526 Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year 2020
#> 8 526 Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year 2021
#> 9 526 Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year 2022
#> 10 526 Vitali, Vitali e Vi… 8871… cement integrate… tonnes per year 2023
#> # ℹ 4,962 more rows
#> # ℹ 5 more variables: production <dbl>, emission_factor <dbl>,
#> # plant_location <chr>, is_ultimate_owner <lgl>, emission_factor_unit <chr>
Score the match precision between the loanbook and abcd datasets
match_name()
scores the match between names in a
loanbook dataset and names in an asset-based company dataset. The names
come from the columns name_direct_loantaker
and
name_ultimate_parent
of the loanbook dataset, and from the
column name_company
of the asset-based company dataset. In
the loan book data set, it is possible to optionally add any number of
name_intermediate_parent_*
columns, where *
indicates the level up the corporate tree from
direct_loantaker
.
The raw names are internally transformed applying best-practices commonly used in name matching algorithms, such as:
- Remove special characters
- Replace language specific characters
- Abbreviate certain names to reduce their importance in the matching
- Remove corporate suffixes when necessary
- Spell out numbers to increase their importance
The similarity is then scored between the internally-transformed
names of the loanbook against the names in the abcd. For more
information on the scoring algorithm used, see
stringdist::stringsim()
.
match_name(loanbook, abcd)
#> # A tibble: 326 × 22
#> id_loan id_direct_loantaker name_direct_loantaker id_ultimate_parent
#> <chr> <chr> <chr> <chr>
#> 1 L1 C294 Vitale Group UP15
#> 2 L3 C292 Rowe-Rowe UP288
#> 3 L5 C305 Ring AG & Co. KGaA UP104
#> 4 L6 C304 Kassulke-Kassulke UP83
#> 5 L6 C304 Kassulke-Kassulke UP83
#> 6 L7 C227 Morissette Group UP134
#> 7 L7 C227 Morissette Group UP134
#> 8 L8 C303 Barone s.r.l. UP163
#> 9 L9 C301 Werner Werner AG & Co. KGaA UP138
#> 10 L9 C301 Werner Werner AG & Co. KGaA UP138
#> # ℹ 316 more rows
#> # ℹ 18 more variables: name_ultimate_parent <chr>, loan_size_outstanding <dbl>,
#> # loan_size_outstanding_currency <chr>, loan_size_credit_limit <dbl>,
#> # loan_size_credit_limit_currency <chr>, sector_classification_system <chr>,
#> # sector_classification_direct_loantaker <chr>, lei_direct_loantaker <chr>,
#> # isin_direct_loantaker <lgl>, id_2dii <chr>, level <chr>, sector <chr>,
#> # sector_abcd <chr>, name <chr>, name_abcd <chr>, score <dbl>, …
match_name()
defaults to scoring matches between name
strings that belong to the same sector. Using
by_sector = FALSE
removes this limitation – increasing
computation time, and the number of potentially incorrect matches to
manually validate.
match_name(loanbook, abcd, by_sector = FALSE) %>% nrow()
#> [1] 656
# Compare to
match_name(loanbook, abcd, by_sector = TRUE) %>% nrow()
#> [1] 326
The min_score
argument allows you to set a minimum
threshold score.
match_name(loanbook, abcd, min_score = 0.9) %>%
pull(score) %>%
range()
#> [1] 0.9007692 1.0000000
Maybe overwrite matches
If you are happy with the matching coverage achieved, proceed to the
next step. Otherwise, you can manually add matches, not found
automatically by match_name()
. To do this, manually inspect
the abcd
and find a company you would like to match to your
loanbook. Once a match is found, use excel to write a .csv file similar
to r2dii.data::overwrite_demo
,
where:
-
level
indicates the level that the manual match should be added to (e.g.direct_loantaker
) -
id_2dii
is the id of the loanbook company you would like to match (from the output ofmatch_name()
) -
name
is the abcd company you would like to manually link to -
sector
optionally you can also overwrite the sector. -
source
this can be used later to determine where all manual matches came from.
matched <-
match_name(
loanbook = loanbook,
abcd = abcd,
min_score = 0.9,
overwrite = r2dii.data::overwrite_demo
)
#> Warning: You should only overwrite a sector at the level of the 'direct
#> loantaker' (DL). If you overwrite a sector at the level of the 'ultimate
#> parent' (UP) you consequently overwrite all children of that sector,
#> which most likely is a mistake.
Notice the warning.
Validate matches
Write the output of match_name()
into a .csv file, for
example using readr to save a CSV file:
Compare, edit, and save the data manually:
- Open
matched.csv
with any spreadsheet editor (Excel, Google Sheets, etc.). - Compare the columns
name
andname_abcd
manually to determine if the match is valid. Other information can be used in conjunction with just the names to ensure the two entities match (sector, internal information on the company structure, etc.) - Edit the data:
- If you are happy with the match, set the
score
value to1
. - Otherwise set or leave the
score
value to anything other than1
.
- If you are happy with the match, set the
- Save the edited file as, for example,
valid_matches.csv
.
Re-import the edited file (validated), for example using readr to read a CSV file:
valid_matches <- readr::read_csv(file = file.path(tempdir(), "valid_matches.csv"), show_col_types = FALSE)
For more information on validating matches, please see the
documentation for prioritize()
.
Prioritize validated matches by level
The validated dataset may have multiple matches per loan. Consider
the case where a loan is given to “Acme Power USA”, a subsidiary of
“Acme Power Co.”. There may be both “Acme Power USA” and “Acme Power
Co.” in the abcd
, and so there could be two valid matches
for this loan. To get the best match only, use prioritize()
– it picks rows where score
is 1 and level
per
loan is of highest priority
:
some_interesting_columns <- c("id_2dii", "level", "score")
prioritized_matches <-
valid_matches %>%
prioritize()
prioritized_matches %>% select(all_of(some_interesting_columns))
#> # A tibble: 175 × 3
#> id_2dii level score
#> <chr> <chr> <dbl>
#> 1 DL129 direct_loantaker 1
#> 2 DL144 direct_loantaker 1
#> 3 DL270 direct_loantaker 1
#> 4 DL86 direct_loantaker 1
#> 5 DL5 direct_loantaker 1
#> 6 DL80 direct_loantaker 1
#> 7 DL150 direct_loantaker 1
#> 8 DL3 direct_loantaker 1
#> 9 DL65 direct_loantaker 1
#> 10 DL79 direct_loantaker 1
#> # ℹ 165 more rows
By default, highest priority refers to the most granular match
(direct_loantaker
). The default priority is set internally
via prioritize_level()
.
prioritize_level(valid_matches)
#> [1] "direct_loantaker" "ultimate_parent"
You may use a different priority. One way to do that is to pass a
function to priority
. For example, use rev()
to reverse the default priority.
valid_matches %>%
prioritize(priority = rev) %>%
select(all_of(some_interesting_columns))
#> # A tibble: 175 × 3
#> id_2dii level score
#> <chr> <chr> <dbl>
#> 1 UP190 ultimate_parent 1
#> 2 UP101 ultimate_parent 1
#> 3 UP39 ultimate_parent 1
#> 4 UP63 ultimate_parent 1
#> 5 UP224 ultimate_parent 1
#> 6 UP132 ultimate_parent 1
#> 7 UP12 ultimate_parent 1
#> 8 UP20 ultimate_parent 1
#> 9 UP134 ultimate_parent 1
#> 10 UP127 ultimate_parent 1
#> # ℹ 165 more rows
A Note on Sector Classifications and the borderline
Flag
A note on sector classification: Matches are preferred when the
sector from the loanbook
matches the sector from the
abcd
. The loanbook
sector is determined
internally using the sector_classification_system
and
sector_classification_direct_loantaker
columns. Currently,
we only allow a couple specific values for
sector_classification_system
:
GICS, ISIC, NACE, NAICS, PSIC, SIC
If you would like to use a different classification system, please raise an issue in r2dii.data and we can incorporate it.
There are a zoo of sector classification code systems out there. Some are granular, some are not. Since we currently cover a particular portion of the supply chain (i.e. production), it is important we try to only match the ABCD with companies that are actually active in this portion of the supply chain.
An issue arises when, for example, a company is classified in the
“power transmission” sector. In a perfect world, these companies would
produce no electricity, and we would not try to match them. In practice,
however, we find there is often overlap. For this reason, we introduced
the borderline
flag.
In the example below, we see two classification codes coming from the SIC classification standard:
r2dii.data::nace_classification %>%
filter(code %in% c("D35.11", "D35.14"))
#> # A tibble: 2 × 6
#> original_code description code sector borderline version
#> <chr> <chr> <chr> <chr> <lgl> <chr>
#> 1 35.11 35.11 Production of electricity… D35.… power FALSE 2.1
#> 2 35.14 35.14 Distribution of electrici… D35.… power TRUE 2.1
Notice that the code D35.11 corresponds to power generation. This is
an identical match to PACTA’s power
sector, and thus the
borderline
flag is set to FALSE
. In contrast,
code D35.14 corresponds to the distribution of electricity. In a perfect
world, we would set this code to not in scope
, however
there is still a chance that these companies produce electricity. For
this reason, we have mapped it to power
with
borderline = TRUE
.
In practice, if a company has a borderline
of
TRUE
and is matched, then consider the company in
scope. If it has a borderline
of TRUE
and
isn’t matched, then consider it out of scope.
Calculate targets
You can calculate scenario targets using two different approaches: Market Share Approach, or Sectoral Decarbonization Approach.
Market Share Approach
The Market
Share Approach is used to calculate scenario targets for the
production
of a technology in a sector. For example, we can
use this approach to set targets for the production of electric vehicles
in the automotive sector. This approach is recommended for sectors where
a granular technology scenario roadmap exists.
Targets can be set at the portfolio level:
# Use these datasets to practice but eventually you should use your own data.
scenario <- r2dii.data::scenario_demo_2020
regions <- r2dii.data::region_isos_demo
market_share_targets_portfolio <-
target_market_share(
data = prioritized_matches,
abcd = abcd,
scenario = scenario,
region_isos = regions
)
market_share_targets_portfolio
#> # A tibble: 1,076 × 10
#> sector technology year region scenario_source metric production
#> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
#> 1 automotive electric 2020 global demo_2020 projected 139052.
#> 2 automotive electric 2020 global demo_2020 target_cps 139052.
#> 3 automotive electric 2020 global demo_2020 target_sds 139052.
#> 4 automotive electric 2020 global demo_2020 target_sps 139052.
#> 5 automotive electric 2021 global demo_2020 projected 140395.
#> 6 automotive electric 2021 global demo_2020 target_cps 140195.
#> 7 automotive electric 2021 global demo_2020 target_sds 145990.
#> 8 automotive electric 2021 global demo_2020 target_sps 140505.
#> 9 automotive electric 2022 global demo_2020 projected 141737.
#> 10 automotive electric 2022 global demo_2020 target_cps 141314.
#> # ℹ 1,066 more rows
#> # ℹ 3 more variables: technology_share <dbl>, scope <chr>,
#> # percentage_of_initial_production_by_scope <dbl>
Or at the company level:
market_share_targets_company <-
target_market_share(
data = prioritized_matches,
abcd = abcd,
scenario = scenario,
region_isos = regions,
by_company = TRUE, # Output results at company-level
weight_production = FALSE
)
market_share_targets_company
#> # A tibble: 14,505 × 11
#> sector technology year region scenario_source name_abcd metric production
#> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
#> 1 automoti… electric 2020 global demo_2020 Bernardi… proje… 324592.
#> 2 automoti… electric 2020 global demo_2020 Bernardi… targe… 324592.
#> 3 automoti… electric 2020 global demo_2020 Bernardi… targe… 324592.
#> 4 automoti… electric 2020 global demo_2020 Bernardi… targe… 324592.
#> 5 automoti… electric 2020 global demo_2020 Christia… proje… 512438.
#> 6 automoti… electric 2020 global demo_2020 Christia… targe… 512438.
#> 7 automoti… electric 2020 global demo_2020 Christia… targe… 512438.
#> 8 automoti… electric 2020 global demo_2020 Christia… targe… 512438.
#> 9 automoti… electric 2020 global demo_2020 Donati, … proje… 277214.
#> 10 automoti… electric 2020 global demo_2020 Donati, … targe… 277214.
#> # ℹ 14,495 more rows
#> # ℹ 3 more variables: technology_share <dbl>, scope <chr>,
#> # percentage_of_initial_production_by_scope <dbl>
Sectoral Decarbonization Approach
The Sectoral
Decarbonization Approach is used to calculate scenario targets for
the emission_factor
of a sector. For example, you can use
this approach to set targets for the average emission factor of the
cement sector. This approach is recommended for sectors lacking
technology roadmaps.
# Use this dataset to practice but eventually you should use your own data.
co2 <- r2dii.data::co2_intensity_scenario_demo
sda_targets <-
target_sda(
data = prioritized_matches,
abcd = abcd,
co2_intensity_scenario = co2,
region_isos = regions
) %>%
filter(sector == "cement", year >= 2020)
#> Warning: Removing rows in abcd where `emission_factor` is NA
sda_targets
#> # A tibble: 110 × 6
#> sector year region scenario_source emission_factor_metric
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 cement 2020 advanced economies demo_2020 projected
#> 2 cement 2020 developing asia demo_2020 projected
#> 3 cement 2020 global demo_2020 projected
#> 4 cement 2021 advanced economies demo_2020 projected
#> 5 cement 2021 developing asia demo_2020 projected
#> 6 cement 2021 global demo_2020 projected
#> 7 cement 2022 advanced economies demo_2020 projected
#> 8 cement 2022 developing asia demo_2020 projected
#> 9 cement 2022 global demo_2020 projected
#> 10 cement 2023 advanced economies demo_2020 projected
#> # ℹ 100 more rows
#> # ℹ 1 more variable: emission_factor_value <dbl>
PREVIOUS CHAPTER: Preparatory Steps
NEXT CHAPTER: Interpretation of Results