First rotation Vinnova

Project owned by Vinnova and part of NLU Talent Program

1y ago

During the first rotation (9 months), we worked with helping Vinnova (the Swedish innovation agency) automate tasks related to processing text and finding trends in their project portfolio and, overall, trying to figure out Vinnova’s AI needs at the start of their digitalization journey.

The data we worked with was mainly project abstracts already converted to a machine-readable format. As for tools and models, we have been using GPT-SW3 (the 6.7B being the biggest model size) as well as other NLP resources (e.g. BERT and SentenceTransformers).

The NLP use cases we have encountered can be divided into the following four groups: classification, topic modelling, keyword extraction and semantic similarity. In addition, we have done some work on network visualization as well as some experimental development on code generation and open book question answering during the first rotation.

Resources

GitHub - K2triinK/NLU_Talent_Program_first_rotation

2023-10-25 12:53 Code / Framework

Part of the code created during the first rotation of the NLU Talent Program.

Use case: Food classification

2023-07-10 12:07 Post

There is a need for identifying all food-related projects in the project portfolio. Today this is done manually using labels that applicants themselves have assigned to their project: that data is extremely noisy.

We had access to some labeled data and trained binary classifiers both in the demo app (uses setfit as the classifier) as well as in Azure Language Studio in order to identify food-related projects and got a similar F1 score around 90% in both cases. The next step was to classify all food-related projects into 16 different categories but due to many of the classes having very few labeled examples the results were not great.

Use case: ICT classification and topic modelling

2023-07-10 12:03 Post

At the beginning of the rotation, there was an interest in identifying ICT-related projects in the Horizon 2020 health cluster. There was no labeled data so we went for a keyword-based approach and created a list of keywords that would help us find ICT-related projects. We later improved the method by taking into account whether the topic of each project was ICT-related or not and by considering the number of keyword matches in the project abstract.

The next step was topic modelling – we worked with both LDA and BERTopic - to visualize what the projects were about, and which topic clusters could be identified. Lastly, we worked with network visualization in order to be able to look into the relationships between actors taking part in these projects.

Use case: Energy classification

2023-07-10 12:01 Post

Every year, Vinnova has the duty to present a report to the International Energy Agency (IEA) specifying which projects related to energy (more specifically, classifiable by their guide) in the year portfolio were granted funding. So, the task is to extract the energy-related projects from the portfolio (binary classification) and, after, classifying those texts according to the labels specified by the IEA (multi-label classification). This was previously done by filtering the year portfolio in Qlik Sense based on certain tags that were related to energy and then doing a manual revision, cleaning and classification according to the IEA key. However, it was tedious manual work and not very precise, so we worked on fine-tuning models to perform the task. This was defined as a two-step task where first, energy projects are extracted from the entire year portfolio and, afterwards, these selected projects are given an energy-type label.

For the 2022 classification, a Swedish BERT model was fine-tuned to perform binary classification. However, due to the lack of sufficient and representative data for multi-label classification (only 422 texts, obtained from the manual classification of the years 2018-2021), the multi-label step was done manually.

With the 2022 report we realized that the Qlik Sense method was not sufficient and a corrigendum would be needed for the years 2018-2021. So, with the additional data, another Swedish BERT model was fine-tuned, as well as a GPT-SW3 model that was prompt-tuned, to perform binary classification, and then another Swedish BERT and another GPT-SW3 models were used to carry out the task of multi-label classification. Finally, manual validation and revision was required. The multi-label results, although they match usually with the gold, are currently to be taken as a guide more than the truth.

Demo app

2023-07-10 11:40 Post

To gather all the possible use cases and code in a single place, and to be able to show it to our colleagues at Vinnova, we decided to make a demo of an app with a library named Gradio. The functionalities showcased can be divided into two sub-types, the ones making use of GPT-SW3 models, and the ones that do not.

Using GPT-SW3 we: classify documents, discover topics, extract keywords, visualize the portfolio, ask PDFs and classify zero-shot.

Using other models and functionalities, we: find documents by keywords, find most relevant documents, find most similar documents to another document, map to topics and rank evaluators (bedömare).

Attributes

AI Technology

NLP