Data Readiness Assessment Method
Detta är en snabbreferens till metoden för bedömning av databeredskapsnivå, som också använts i de fallstudier som redovisas i Databeredskap för språkteknologiska tillämpningar. Frågorna finns också med i sin helhet i We need to talk about data.
Q1 Do you have programmatic access to the data? The data should be made accessible
to the people who are going to work with it, in a way that makes their work as easy as
possible. This usually means programmatic access via an API, database, or spreadsheet.
Q2 Are your licenses in order? In the case you plan on using data from a third-party
provider, either commercial or via open access, ensure that the licences for the data permit the kind of usage that is needed for the current project. Furthermore, make sure you follow the Terms of Service set out by the provider.
Q3 Do you have lawful access to the data? Make sure you involve the appropriate legal
competence early on in your project. Matters regarding, e.g., personal identifiable information, and GDPR have to be handled correctly. Failing to do so may result in a project failure, even though all technical aspects of the project are perfectly sound.
Q4 Has there been an ethics assessment of the data? In some use cases, such as when
dealing with individuals’ medical information, the objectives of the project require an ethics assessment. The rules for such a probe into the data are governed by strict rules, and you should consult appropriate legal advisors to make sure your project adheres to them.
Q5 Is the data converted to an appropriate format? Apart from being accessible
programmatically, and assessed with respect to licenses, laws, and ethics, the data should also be converted to a format appropriate for the potential technical solutions to the problem at hand. One particular challenge we have encountered numerous times, is that the data is on the format of PDF files. PDF is an excellent output format for rendering contents on screen or in print, but it is a terrible input format for data-driven automated processes.
Q6 Are the characteristics of the data known? Are the typical traits and features of the
data known? Perform an exploratory data analysis, and run it by all stakeholders in the
project. Make sure to exemplify typical and extreme values in the data, and encourage the project participants to manually look into the data.
Q7 Is the data validated? Ensure that the traits and features of the data make sense, and, e.g., records are deduplicated, noise is catered for, and that null values are taken care of.
Q8 Do stakeholders agree on the objective of the current use case? What problem are
you trying to solve? The problem formulation should be intimately tied to a tangible business value or research hypothesis. When specifying the problem, make sure to focus on the actual need instead of a potentially interesting technology. The characteristics of the problem dictates the requirements on the data. Thus, the specification is crucial for understanding the requirements on the data in terms of, e.g., training data, and the need for manual labelling of evaluation or validation data. Only when you know the characteristics of the data, it will be possible to come up with a candidate technological approach to solve the problem.
Q9 Is the purpose of using the data clear to all stakeholders? Ensure that all people
involved in the project understands the role and importance of the data to be used. This is to solidify the efforts made by the people responsible for relevant data sources to produce data that is appropriate for the project’s objective and the potential technical solution to address the objective.
Q10 Is the data sufficient for the current use case? Given the insight into what data is
available, consider the questions: What data is needed to solve the problem? Is that a
subset of the data that is already available? If not: is there a way of getting all the data
needed? If there is a discrepancy between the data available, and the data required to solve the problem, that discrepancy has to be mitigated. If it is not possible to align the data available with what is needed, then this is a cue to go back to the drawing board and either iterate on the problem specification, or collect suitable data.
Q11 Are the steps required to evaluate a potential solution clear? How do you know if
you have succeeded? The type of data required to evaluate a solution is often tightly
connected to the way the solution is implemented: if the solution is based on supervised
machine learning, i.e., requiring labelled examples, then the evaluation of the solution will
also require labelled data. If the solution depends on labelled training data, the process of annotation usually also results in the appropriate evaluation data. Any annotation effort should take into account the quality of the annotations, e.g., the inter-annotator agreement; temporal aspects of the data characteristics, e.g., information on when we need to obtain newly annotated data to mitigate model drift; and, the representativity of the data.
Q12 Is your organization prepared to handle more data like this beyond the scope of
the project? Even if the data processing in your organization is not perfect with respect to the requirements of machine learning, each project you pursue has the opportunity to
articulate improvements to your organization’s data storage processes. Ask yourself the
questions: How does my organization store incoming data? Is that process a good fit for
automatic processing of the data in the context of an NLP project, that is, is the data stored on a format that brings it beyond Band C (accessibility) of the Data Readiness Levels? If not; what changes would need to be made to make the storage better?
Q13 Is the data secured? Ensure that the data used in the project is secured in such a way that it is only accessible to the right people, and thus not accessible by unauthorized users. Depending on the sensitivity of the project, and thus the data, there might be a need to classify the data according to the security standards of your organization (e.g., ISO 27001), and implement the appropriate mechanisms to protect the data and project outcome.
Q14 Is it safe to share the data with others? In case the project aims to share its data with
others, the risks of leaking sensitive data about, e.g., your organization’s business plans or abilities have to be addressed prior to sharing it.
Q15 Are you allowed to share the data with others? In case the project wishes to share
its data, make sure you are allowed to do so according to the licenses, laws, and ethics
previously addressed in the project.