In my most recent role at Microsoft, one of the aspects I loved most was consulting with clients on their toughest data science problems. Initially I was skeptical about the impact one could make on a challenging data science problem in 5+ days without any prior knowledge in the domain, the data, or the problem to be solved. It wasn’t until I did my first few projects that I realized how powerful a solid workshop playbook is to adding value to a customer’s data science problems.
It starts with a solid question
It seems intuitive, that every good data science outcome probably started with a good data science question. Truth is, AI, ML, data science, are practiced more by the marketing team than the engineering team in industry, which leads to a lot of ill-defined data science projects and in turn, ill-fated data science outcomes.
For me, a good question is a simple statement derived from a formal process that includes:
- Who is asking this question and what is the impact to answering it? Think of this as question/answer fit; if you don’t know who needs the answer or why it matters, it’s most likely not worth answering.
- Will the answer add insight? Rather than refine existing understanding? Why does this matter? A meaningful data science project should create new insights, if the insights are already available via existing means, the overall ROI on the project will be diminished. Seek to discover new meaning rather than incrementally add to existing knowledge.
- Is it meaningful? Just because you have a great question, and the answer doesn’t already exist, doesn’t mean it’s a good candidate for a data science expedition. Make sure you can tie the result to business outcomes.
- Is it tractable? Intuitively, before you start exploring the model, does it seem there is an answer somewhere in the exercise? If you’re chasing a wild goose, it will be hard to bound the problem and establish a measure of completeness.
- Is it bounded? This is one of the most important aspects, if the question is too vague or open-ended, it will be difficult to select and tune a model that can achieve your goals. The best way to think about this is in terms of inputs and outputs, if you can’t specify the output clearly, working backwards through the model and the inputs will be difficult.
What does a good question look like?
What measurements can the maintenance team use to predict a possible motor failure on a shop floor machine with greater than 80% confidence using existing telemetry data before an outage occurs?
This question has:
- An owner: the maintenance team
- An insight: Predict possible motor failure
- Meaning: Before an outage occurs
- Tractability: Intuitively telemetry data on prior machine operation and failures should give us the answer we seek
- Bounds: An existing data source and a confidence rating helps us bound the problem and know when we’ve finished
Wallow in the data
I love this quote by Sir Arthur Conan Doyle from Sherlock Holmes:
‘Data! Data! Data!’ he cried impatiently. ‘I can’t make bricks without clay.’
Too many data science projects rush head first into model selection and don’t spend enough time truly understanding the relationship between the underlying model and the data. You must determine:
- What data will I need to answer the question?
- Where will this data come from? Does it already exist or will I have to create new data through feature engineering? Is data missing, will it need to be imputed? Will we need to create new categories from continuous values?
- Is the data “clean”. Data preparation can take up to 50% of the project time. De-duplication, removal of extraneous data, reformatting, are all cleaning tasks that may be required to prepare the data for modelling.
- Column analysis. What is the distribution of the data. What kind of values form the population. Will new incoming data distort the distribution/population in the future?
Once you have a candidate data set, you need to revisit your question and ask, “Given the data we have, are we still able to answer the question? The relationship between the question and the data is iterative, and before exiting this loop you should feel confident the question and the data set are aligned.
In data science it is common for teams to miss this step and go straight to the algorithms. Model selection is crucial to the success of your data science project. So what is a model? Put simply, a model helps explain the relationship between your question and your data. Let’s go back to our original question, here we’re trying to predict an outcome based on an existing data set. A model will help us make assumptions about the telemetry data and its ability to predict failure of a machine. There are many models, that fall under groups, for example, one potential model we could use is Linear Regression, where we might assume that there is a relationship between a measure of a motors insulation resistance to potential failure.
Once we have a set of candidate models that help us create a relationship between our question and answer, we can move towards fitting the data to the model.
Let’s revisit our example. We think Linear Regression is a good model to describe our question and answer. The base Linear Regression model has two parameters, an intercept and a slope. What we want though is a function that we can use to predict future events, so we need to create a new function from the existing data that helps us answer our question. At a naive level, we could simply work out a basic straight line function that given the value of the motor’s insulation resistance we simply output a value “fail/no-fail”. But there are many ways to create a linear model, and this is where frameworks like scikit-learn come into play, as they can help us fit a model to our data, with enough control over the parameters to ensure we can meet our goal, in our case, 80% confidence. scikit-learn as an awesome chart that helps visually explain this.
At this point in the process, we have a data set we are using for training, we’ve arrived at a model that we think represents our question and answer relationship well, and we’re using algorithms to help us fit our data to our model. At this point, we want to be highly empirical about our process, and this is where evaluation is crucial. It is also crucial to be open to iteration, data science is very iterative, and learnings from the data -> algo -> model cycle can help us refine the process.
Key measures you should consider are:
- Performance: As you work through different models and different parameters/hyper-parameters, you must always measure the performance of each iteration. Once you reach your desired threshold, it is good discipline to baseline that experiment and move forward. Likewise, candidates that reduce your performance should be rejected and documented.
- Explainability: Black-box models are problematic, as they can deliver superior results, but if you cannot describe why, or reason as to the relationship between the question and the answer, then you should treat these models with suspicion, and continue searching for a model with equal performance that is easy to reason about.
- KISS: As you explore multiple models and parameters/hyper-parameters, always favor simpler candidates. This speaks to the first two points; a model that is easy to reason about and performs to the standard we desire is the better than a model that is hard to explain but performs beyond our requirements.
Ah, shipping. One of the biggest challenges to any data science project. What does shipping even mean? Have we shipped once we have a winning candidate model and parameters/hyper-parameters? Well, this is somewhat subjective, and it really belongs as a criteria in its own right.
At the start of every data science project, clearly define what it means to ship. If the model is going to be used in an offline business process, then shipping might mean wrapping it in a lightweight web page and exposing that to the business. If the model is going to be used in your product, in a production environment, then you’re going to have to think about operational concerns, monitoring, life-cycle, collecting model telemetry, retraining and re-deploying new versions, etc.
In general, shipping a model is measured by use of the model by the owners of the question. You know you’ve shipped when those owners begin to receive answers, and even more important, you should be able to measure the ROI on those answers. Like every good feature, if the model provides no value, maintaining it over time does not make sense.
Running a data science workshop before each project in my mind is a must. It helps identify all the relevant stakeholders, forces everyone through a methodical process, and ensures we’re using objective measures to define success or failure. The most important aspect of a workshop is to determine whether a data science project is worthwhile, before setting off into the great unknown.
Need help running a workshop? Drop me a line!