Week 1 - Week 3: Laying the Foundations

Post by Donor Retention 147d ago update
OtherResearch & Reports

We started the project by getting familiar with the dataset provided by UNICEF, which includes donor data over the past 5 years and email interactions over the past 6 months. Unfortunately, we only have the first and last interaction events for donors, not the intermediate. Despite this limitation, we will use the available data to build and assess our models. Our main goal is to evaluate both the predictions' quality and the models' interpretability to understand what factors lead to donor churn.

Next, we dived into literature and market research on predicting non-profit donor churn. We arranged a few meetings with experts, which greatly helped us and guided us in the right direction. This research helped us identify key features and methodologies used in donor retention models. We also assessed different platforms for deploying our solution. While we haven’t finalised this yet, we’re getting closer. We know our models will be deployed to AWS, and that a solution with a direct integration in Salesforce for a user-friendly interface won’t be possible, so we are currently looking into alternatives with our client. The alternatives include Qlik and building our platform. 

About the models for the predictions, our research pointed us towards more traditional approaches. These include decision trees, random forests, K-Nearest Neighbors, Gradient Boosting Machine, CatBoost, Support Vector Machine, and Extreme Gradient Boosting. We will start by implementing these models. One challenge with these models is the need for manual feature engineering. In week 2, after thorough data cleaning (removing duplicates, handling missing values, and standardizing formats), we tackled this task. We combined donor data with their email interactions, creating features like interaction frequency, time since the last donation, and response rates. We also included an estimate of salary per donor based on their postal code, which we scrapped from the web.

Other things that we have considered and implemented are: outlier removal using exploratory approaches and the Interquartile Range method, feature extraction using Pearson and Kendall Tau for numerical features and chi^2 tests for categorical features. We spent quite some time on this, as the quality of input data directly affects the quality of our model predictions. Lastly, something interesting that we addressed, is the unbalanced nature of our data, where the number of non-churners is significantly higher than the number of churners. To balance the classes, we generated synthetic data using SMOTE and ADASYN, and will be experimenting with training models with this new balanced dataset and with the original dataset and a weighted loss functions.

Feel free to reach out if you have any questions, need more information, or have any advice for us!

Attributes

Other, Research & Reports