Project Summary

Page by Elise Hammarström 119d ago update
Research & Reports

KnowledgeSeeker is a tool built on the request of the non-profit Mind to effortlessly fetch and retrieve new research, in the aim to spread this new research to Minds internal team and external partners. The tool is built by three students from Uppsala University (Elise, Finn and Marcus), enrolled in the talent program AI for Impact jointly held by AI Sweden and Google.org. After the project duration of 10 weeks during the summer 2024, we can now present our final product! 🎉

Final product

Objectives of the platform

The platform has two objectives:

  1. Give research based answers to user questions
  2. Enable the user to continuously get the latest research based on their research interests directly into their email inbox

The goal of the first objective is to answer user questions in a manner that is easily understandable for non-experts, as well as provide direct sources that underpinned the generated answer. 

The goal of the second objective is to enable users to continuously be well-informed in their chosen research areas by setting up a rule, that we call a “Lookout”, that scans the research community on the latest research. The aim is to deliver the latest research to the user, so the user can easily find the latest research directly in their inbox. 

Primary features

Based on these two objectives, the platform has two primary features in the user interface. 

One is to ask a research question, and get the generated answer along with the references. 
The user can select which sources to use in answering their question. Sources can either be their own uploaded research or utilising research archives such as PubMed, ClinicalTrials.gov or Arxiv. 
 

The other major feature is the Lookouts. Here the user can specify their research area of interest as well as specifying how long time range the lookout should take into account in searching and ranking relevant research. Apart from that, the user can also specify locations on where the research was conducted. When a lookout is created, relevant sources from ClinicalTrials.gov are fetched and ranked, where the top number of results specified by the user are shown in the interface. The user can read these results and their AI generated summary, as well as explore the original research source at hand.

How did we create it? Our project development

After two weeks of coding, we developed our first prototype. 

This version allowed users to upload multiple PDFs and query them via a search bar. The system generates responses using Retrieval-Augmented Generation (RAG), leveraging various open-source large language models and embedding models. To find relevant articles, we utilized similarity search within a Postgres vector database that stores all the embeddings. We built the application using the Streamlit framework.

Our prototype successfully answered questions on topics explicitly mentioned in the documents. Additionally, we experimented with different embedding models to compare their performance. Since our queries were in Swedish while the papers were in English, we noticed that language often influenced the results more than context. For instance, text chunks containing Swedish words were automatically prioritized. Moreover, when generating responses, the answers were frequently returned in English.

To simplify integration with various LLMs, we decided to use the command R large language model and its corresponding embedding model from Cohere. For the new functionality, we revamped our framework using React Next.js for the frontend, FastAPI for the backend, and maintained the Postgres database. We introduced the ability to automatically retrieve new articles from research archives like arXiv, PubMed, and ClinicalTrials, eliminating the need for users to manually find and upload research articles beforehand. Through APIs, we extract abstracts and summaries, which are then used as input for the LLM instead of processing entire papers.

To prevent irrelevant text snippets—such as references or side notes—from being selected, we developed a custom algorithm using Unstructured and Spacy to filter out unnecessary information. This classifier achieved over 99% accuracy on test data. To further enhance the quality of the selected text chunks, we used LLMs to extract meaningful content and re-rank the chunks based on their relevance to the query. To address language-related issues, we decided to work exclusively with English texts for this functionality.

Researchers provided highly positive feedback on our application, noting its accuracy, relevance, and potential utility.

  • Accurate and Insightful Responses: Lisa Wendby from Mind appreciated the tool’s ability to identify research gaps, calling it "helpful for understanding where research is lacking."
  • Valuable for Researchers: Researchers found the application "very valuable," especially for its transparency in showing which articles responses are based on. The researchers noted it could be beneficial for both researchers and staff at Mind.
  • Handling Complex Queries: The researchers were impressed with how the application addressed more abstract questions, offering insightful responses on complex topics like familial transmission of suicide risk.
  • Performance and Potential: Researchers observed that while the application’s performance varied with uploaded articles, it showed significant promise, particularly when integrated with PubMed. One researcher praised the application for its "very promising" potential and appreciated that responses are based on identifiable references.

In the final version of our application, we introduced a feature called "Lookout." 

This functionality allows users to create and maintain lists of relevant research articles on specific topics, which are automatically updated whenever new studies are published. Integrated with Zapier and Google Cloud Scheduler, "Lookout" can send email updates to subscribers, including summaries of the latest research. Users can also filter these updates by specific time windows, such as the last month or the last three months, and by regions like Sweden or Europe.

Lookouts:

Create a lookout:

This enhancement with lookout further strengthens the application’s utility, providing researchers and professionals with timely, targeted updates on their areas of interest. 

Further improvements could include the design being enhanced, additional archives like Google Scholar being integrated, and sources such as Wikipedia being incorporated. Multilingual support, including Nordic languages, could also be added, and the services might be extended to support more organizations.

As we’re ending our project within the talent program AI for Impact - we’re really proud of the application we’ve built. We’re happy that what started as a brief for an application aimed for Mind and other non-profits to use to find the latest research, also resulted in an additional use case for the broader research community to find research and identify research gaps. 

You can find the recording of our final presentation here: https://www.youtube.com/watch?v=qFWhLo0sWfQ   

If you have any questions or advice, please feel free to reach out to us!

PS! we’re also looking for master thesis projects within AI and/or data engineering, so if you and your company have an opportunity for master thesis students this spring, we would gladly discuss further. 

Elise Hammarström
✉️Email: elise.hammarstrom@gmail.com
💬LinkedIn: https://www.linkedin.com/in/elise-hammarstrom/ 

Finn Vaughankraska - 
✉️Email: vaughankraska@gmail.com
💬LinkedIn: https://www.linkedin.com/in/finn-vaughankraska-a9520b1a2/ 

Markus Rupp 
✉️Email: markus2000rupp@gmail.com
💬LinkedIn: https://www.linkedin.com/in/markus-rupp-614154240/

Attributes

Research & Reports