SageMaker Studio was announced at AWS re:Invent 2019, as a web-based IDE for working with Machine Learning models. It incorporates the other SageMaker components also released at the same time: AutoPilot, Debugger, Model Monitor, and Notebooks. It’s intended to address some of the complexities of getting started with machine learning, which is a vastly complex domain of knowledge that’s traditionally required a lot of time and effort to get involved. I’m keen to see how easy it is to get going.
THE USE CASE
I’d like to try predicting my home’s power consumption over the next year based on historical data. I have a year’s data, at a fairly ridiculous level of granularity (think, about every 7 seconds) that I’ve captured from my smart meter, already in S3.
PREREQUISITES
AWS SSO
SageMaker Studio works best with AWS SSO (Single Sign On) enabled, and for users to be logged in via AWS SSO. If your organisation doesn’t already have AWS SSO enabled, you can use IAM, though the console warns that some things might not work properly. This post is written from the point of view of the SSO experience.
OHIO
The second prerequisite is regional – SageMaker Studio is only available in the Ohio (us-east-2
) region at the time of writing.
Bear in mind that AWS SSO can only exist in one region at present, and therefore you MUST have created your SSO configuration in us-east-2
or the rest of this process won’t work.
I got this wrong the first time around, and had to delete my AWS SSO configuration from Sydney, and recreate it in Ohio which was a bit of a pain.
SAGEMAKER STUDIO SETUP
IAM AND NETWORK CONFIGURATION
Once I’d gone through the shenanigans above, I was able to log in as my SSO user, switch to the console of the target account, change region to Ohio, and launch SageMaker Studio.
On first launch, SageMaker Studio takes you through the process of setting up IAM Roles, configurations for VPCs and whatnot. I had SageMaker create an IAM role for me, and then I set up the default VPC as my access environment:
Once this step is complete, SageMaker Studio takes a little while to provision some things; for me, this took about 5 minutes.
USER PROFILES
Once your SageMaker Studio instance has provisioned, you need to assign users to it. Eventually, SageMaker Studio appears in the list of SSO apps on the console but if users want to get started before that eventual consistency kicks in, you can send them the address of the SageMaker Studio instance so that they can go direct.
SIGNING IN
After I’ve signed in as my SSO user, I navigate to the address of the SageMaker Studio instance (from above). There’s a bit of “loading things” splash screen animation, and I’m in!
The console looks good – dark-themed, with a “get started” link that I fully intend to click:
CLEANING THE DATA
I’m going to play around with some collected power consumption data from my house. In Australia, “smart meters” are becoming more common and you can have an in-home device which “binds” to the meter and can give you instant power consumption details as granular as every 5 seconds. I’ve got about a year’s worth.
One record looks like this:
{'timestamp': '2019-01-03T12:53:08Z', 'raw_demand': 760, 'multiplier': 1, 'divisor': 1000, 'demand': 0.76}
and records are split across thousands of separate JSON files, each timestamped with when they were saved to S3. I’m not going to start with all of the data, there’s a ridiculous amount, so I’m just going to try training the model with about a month’s worth.
I need to turn this into CSV for AutoPilot to play with, which I can do with the following bash
incantation (note: Linux)
( echo timestamp,raw_demand; find . -type f | xargs grep raw_demand | awk '{ print $2, $4 }' | tr -d "[',]" | tr ' ' , ) > /tmp/data.csv
This gives me a CSV with two columns – the first, the timestamp of the reading, and the second the “raw demand” or figure in watts that’s being drawn through the meter at that point.
timestamp,raw_demand
2018-05-27T06:32:06Z,3789
2018-05-27T06:32:14Z,3971
2018-05-27T06:32:22Z,3824
2018-05-27T06:32:30Z,4011
...
This gives me about 116,000 lines of CSV. Is that too much, not enough, or just right? I don’t know, so I’m just going to go with it.
EXPERIMENTING WITH AUTOPILOT
Perhaps the biggest claim of SageMaker Studio is its integration with the newly-announced AutoPilot, which (as per the AWS blurb): “can be used by people without machine learning experience to easily produce a model or it can be used by experienced developers to quickly develop a baseline model on which teams can further iterate” (from https://aws.amazon.com/sagemaker/ at the time of writing).
KICKING THE TYRES
AutoPilot is an “Experiment”, so you’ll find it under the “Experiments” tab of SageMaker Studio:
- Start up the IDE by logging in as your SSO user and navigating to the SageMaker Studio instance URL provided when you created it (above)
- Click the “Experiments” icon
- Fill in the details for the name, target attribute, input and output S3 locations. I set my “problem type” to “Auto” because I don’t know any better, and I said “Yes” to running a complete experiment
- I click “Create Experiment” and we’re off
The initial screen is the state of the experiment – a little “refresh” button at the top-left allows you to update the progress.
OOOPS …
Well, that went bang – the experiment failed, with the following error message: “ClientError: It is unclear whether the problem type should be MulticlassClassification or Regression. Please specify the problem type manually and retry.”
So, that seems easy enough. A bit of googling and I’m led to a page which tells me “fundamentally, classification is about predicting a label and regression is about predicting a quantity”. Well, predicting my power usage is what I’d like to do, and that’s a quantity, so regression it is!
Hmm, now I have to choose what “objective metric” I want. The pulldown only has “MSE” though, so I go with that.
The second experiment fires up and starts “Analyzing Data”. Other than the fact that the console says it’s “Analyzing Data”, there’s no feedback on what’s going on behind the scenes. If I navigate away from this page, I can get back to it by reloading the SageMaker Studio console, selecting “Experiments”, right-clicking the job I created and choosing “Describe AutoML Job”
This time, the AutoML job appears happy with my choice of problem type – I get a green check next to the “Analyzing Data” arrow.
If you get interested, you can see the various training jobs in the SageMaker console, under Training > Training jobs
and the models under Inference > Models
. It probably won’t surprise you that you don’t get all the fun “spinny brains” graphics from the launch events – those are just CGI meant to wow the audience, but we’re now serious data scientists and don’t need such fluff.
Clicking through the hyperparameter tuning jobs page reveals the frankly colossal amount of … stuff … that goes on under the covers. Mere screenshots can’t do it justice. By default, AutoPilot runs 250 tuning jobs, 10 in parallel. For my data set, each tuning job takes about 5 minutes including instance startup time, so it should take about 2 hours to complete.
After a couple of hours (and a bit), I have a “best” model that I can deploy:
Deploying it is as easy as choosing an endpoint name and instance type, and clicking a button:
CONCLUSION
Yes, AutoPilot does what it says on the tin – it takes the previously mind-bogglingly complex process of building and training a machine learning model and, frankly, makes it pretty straightforward.
What it doesn’t do is help you use the model to do anything; that’s a topic for a separate post though, which will follow this one shortly.