How much data do we need to start Machine Learning?

Thesis Project owned by Volvo Cars English
1y ago update
Authors: Terence Cheng & Alicia Rey Alonso

Universities: University of Tartu & Chalmers University of Technology

One of the most relevant challenges in contemporary Machine Learning does not have to do with the neural networks themselves, rather with the data used to train them. With the appearance of deeper architectures, quality and quantity of the data has become a key point in ensuring a model's success, and correct acquisition and handling of existing data is a prominent field of research.

The need for bigger and more varied data sets clearly leads to investigating whether we could find a certain threshold to determine how much data is enough. With growing concerns about the possibility of reaching a point where there is not enough data to train state-of-the-art networks, being economical about the data we have while also making a smart acquisition of it is necessary. This becomes critical in fields where the annotation of data requires of an expertise not found outside from its own specialists, such as medical data or automotive data.

Studies have been conducted on the effect the amount of data in training sets has on the training of a Deep Neural Network for medical images of CT scans, as well as whether the quality of the data, the correctness of its labels, can have damning impacts on its overall performance. In the same line, this thesis project focuses on predicting the quality of our network given a certain data set to train it with, as well as potentially modelling the relation between the two.

This thesis work aims to determine whether a given data set will produce satisfactory results when used in the training of a neural network for a specific task. Succinctly, this can be summarised in the following points:

  • How much data do we need for a successful machine learning algorithm?
  • Which data signals are useful for data collection?
  • Is the data set good enough to allow for learning and generalisation?

By focusing on these key questions, we wish to answer whether we can generalise a series of functions or parameters that will allow us to predict the data requirements of a network prior to its training stage, or furthermore, prior to the data collection is completed. This type of inference would give general guidelines of when a data set is good enough to use for training a network prior to investing the time and resources needed to carry it out.


Data, Technology
Better Customer Experience, More Efficient, Saving Cost, Smarter Product or Service
Prediction, Optimization, Vision
CNN, DNN, Image Analysis, Machine Learning, Recurrent Neural Networks, Transformer