AI Feature Engineering for Pros
2010 was my first experience with AI in the form of an NLP project at Microsoft. The toolchain, framework and overall process was rudimentary and did not lend itself to rapid iteration or follow any particular engineering workflow. Like most AI projects, the goal is to get to a minimum viable model (MVM), so it’s understandable that automation and tools are deferred until the basics of feature and model selection are complete.
Like most feature engineering scenarios however, deferring these concerns actually retards the iterative process, with engineers spending more time performing scaffolding tasks than data science tasks. This also speaks to the difference in building AI features versus non-AI features. With non-AI features, it’s more crucial to “true up” your stack during development to accelerate iteration, however AI features require a more exploratory approach, where feature engineering and model selection are more valuable than truing up the application during each change.
So how do you set yourself up to engineer AI features like a pro? Easy!
To build AI features quickly with confidence, especially in a team setting, you need to have a deterministic environment that must span development and production. Whether you’re trying to freeze framework versions or easily share experiments, reducing the barriers and friction for other engineers and environments to spin up an AI feature is crucial.
Docker has a number of benefits for AI engineering teams:
- Sharing environments between engineers
- Rapid framework evaluation
- Deterministic deployments through environments
The feature engineering lifecycle for AI features is bifurcated between design and maintenance. During the design phase, the ability to tinker, hack and visualize is critical, and there is no better environment than Jupyter for that.
The best way to think of Jupyter is an online document editor where paragraphs can be code that is executed by an interpreter. Out of the box, Jupyter supports Python, which is perfect for AI engineering given the extensive support provided by libraries like scikit-learn and TensorFlow.
For example, before creating my Git repos, scaffolding an app, building unit tests, etc., it’s much easier and useful to simply start exploring the AI problem. Let’s take the very useful task of calculating the similarity of text, say for example, if you’re doing a deduplication activity. With Jupyter, you create a new notebook, start hacking and iterating, without the need to actually spin up a program.
The great thing about Jupyter is not only does it support Python, it also supports other languages like Golang and Bash! This means you can iterate independently of the dev process until you’ve fleshed out your concept and then simply migrate the working code to an IDE via ctrl-c-v or using Jupyter’s export capabilities.
Oh, and Docker + Jupyter means you can get started with the leading data science stacks like scikit-learn, TensorFlow, Spark and R. This is a huge boost as it means you can start exploring and vetting these AI frameworks and platforms without having to waste time setting them up.
Yes! Jenkins! Now, I’m always in danger of being called out for using Jenkins for just about everything, but let’s face it, when it comes to doing stuff when stuff changes, Jenkins is awesome.
So how do you leverage Jenkins as part of your AI feature workflow. After you’ve moved your feature from iteration to mainstream development, you must monitor the performance of your feature. Now, like the rest of your stack, AI features have specific attributes that must be measured before you deploy them. For example, you might have a classifier that is being trained nightly by a content team. Jenkins is great because you can have Jenkins run a job which prepares a confusion matrix against the latest model checked into source control and if the accuracy is less than the currently deployed version, simply withhold it from deployment, conversely, if the accuracy is better, deploy it. This is a great example of using CI as part of your AI feature pipeline.
The key takeaway here is to stay lightweight and iterative during the design phase, then use proven automation platforms and techniques to ensure you manage the lifecycle of your AI features. Time spent on learning technologies like Docker and Jupyter will not only accelerate your AI feature development but make it easier to move them from development to production.