link

Resource for First rotation Vinnova

Use case: Energy classification

Every year, Vinnova has the duty to present a report to the International Energy Agency (IEA) specifying which projects related to energy (more specifically, classifiable by their guide) in the year portfolio were granted funding. So, the task is to extract the energy-related projects from the portfolio (binary classification) and, after, classifying those texts according to the labels specified by the IEA (multi-label classification). This was previously done by filtering the year portfolio in Qlik Sense based on certain tags that were related to energy and then doing a manual revision, cleaning and classification according to the IEA key. However, it was tedious manual work and not very precise, so we worked on fine-tuning models to perform the task. This was defined as a two-step task where first, energy projects are extracted from the entire year portfolio and, afterwards, these selected projects are given an energy-type label.

For the 2022 classification, a Swedish BERT model was fine-tuned to perform binary classification. However, due to the lack of sufficient and representative data for multi-label classification (only 422 texts, obtained from the manual classification of the years 2018-2021), the multi-label step was done manually.

With the 2022 report we realized that the Qlik Sense method was not sufficient and a corrigendum would be needed for the years 2018-2021. So, with the additional data, another Swedish BERT model was fine-tuned, as well as a GPT-SW3 model that was prompt-tuned, to perform binary classification, and then another Swedish BERT and another GPT-SW3 models were used to carry out the task of multi-label classification. Finally, manual validation and revision was required. The multi-label results, although they match usually with the gold, are currently to be taken as a guide more than the truth.

Attributes

Textual Data