Away we go.
Another 3:30am start for us, prising the eyelids open to see what shiny new things AWS has for us, this time in the AI/ML/Data space, and they don’t disappoint. Dr. Swami Sivasubramanian emerges from the shadows to the dance of about a thousand spotlights, to bring us on a journey from data, through analysis, to innovation.
The message is clear: innovation only results from connecting the dots of data, sometimes massive amounts of data, gathered over time; you can’t have the spark unless you’ve got the background. It’s almost a call-back to a previous key phrase they’ve used, “there’s no compression algorithm for experience”.
Challenges faced by modern organisations in the data space often come down to four things: data doesn’t naturally flow, it isn’t automatically processed, it isn’t centralised, and it isn’t easy to visualise. Addressing these challenges can best be done by starting from a strong data strategy, built on three pillars: foundations, connective tissue, and democratised access to data, all delivered through the AWS principle of removing undifferentiated heavy lifting.
Future-proof data foundations
According to Gartner, 94% of the top 1,000 AWS customers use more than 10 different types of data store; the end-to-end data strategy has to account for this, and it’s here that we get a sneak peek of one of the new offerings (to be discussed later), called Amazon DataZone.
A future-proof data foundation is built on 4 corners: Tools for every workload, performance at scale, removal of heavy lifting, and reliability and security.
Tools for Every Workload
To the challenge of dealing with complicated query structures, and to enable going beyond simple SQL, Athena gets a new query engine with Amazon Athena for Apache Spark – you can create Jupyter workbooks and use those to build your queries and dig deep into the data. It’s Generally Available now, but not in all regions (for example, Sydney doesn’t have it yet). This goes along with yesterday’s announcement of support for Spark in Redshift as well. AWS claims that Spark runs 3 times faster in their environment, with their engine, than the open-source version, so you get a nice performance boost along with your serverless query engine.
Performance at Scale
Reading data is kind of a “solved problem” – read replicas, caching, etc; but the challenge of requiring a single point of write hasn’t been solved, apparently until today. AWS have had Amazon DocumentDB (with MongoDB compatibility) and today they appear to be dropping the parenthesized bit and adding autoscaling clusters, to give us Amazon DocumentDB Elastic Clusters, which scale writes as well as reads for JSON document storage without having to think about sharding.
Remove Heavy Lifting
We re-visit Amazon DevOps Guru, perched on the mountain top, with the reminder that it can detect and remediate database issues detected in RDS. We’re reminded that with S3 Intelligent Tiering we can save on storage costs automatically, and that SageMaker can do similar clever things with ML data using SageMaker Ground Truth and Ground Truth Plus. We’re advised that Dow Jones have used SageMaker to “improve customer engagement by more than 2x”, but we aren’t told what that means or how it’s measured.
Gartner is brought out again, with a statistic that apparently 80% of new data is unstructured or semi-structured, including image data and especially geospatial data, which is why they’re announcing Amazon SageMaker support for Geospatial ML, which should make finding and incorporating geo data into your ML models considerably simpler. We’re treated to a canned demo of how to predict which roads might be subject to flooding based on satellite photos from multiple sources, which is pretty swish.
Reliability and Security
The existing underlying services give you a lot already: S3 is 11 9’s of reliability, LakeFormation allows you to perform security governance across all sorts of stored data, and RDS is already Multi-AZ for resilience, but until now Redshift hasn’t had this capability and so today they announce Amazon Redshift Multi-AZ for automatic failover with guaranteed capacity.
One of the big problems teams have when migrating into a fully-managed PostgreSQL database like RDS is that there are a heap of really useful extensions, but getting them integrated into the managed experience is difficult because of the risk to stability and security of the database engine; until now, AWS have only allowed their own pre-approved extensions, which of course takes a lot of time, so today they’re announcing Trusted Language Extensions for PostgreSQL (including Perl, which is odd) which is open source and you can find at https://github.com/aws/pg_tle/ – this should go a LONG way towards letting customers load in those custom extensions that they haven’t been able to get away from.
For the security of data sitting in places like S3, and for running compute environments like EC2, AWS has already provided GuardDuty; but what about databases? Well, with the preview launch of Amazon GuardDuty RDS Protection, you get a fully-managed machine-learning engine which watches your RDS resources to identify suspicious activity and alert you when something looks a bit off. GuardDuty has been a lifesaver for many organisations, now adding RDS into the mix as a protected target is even better.
Weaving Connective Tissue
This time, three corners underpin this principle: having quality tools and data to drive future growth; using governance to connect siloed teams instead of separate them; and connecting data stores to enable critical survival.
Quality Tools and Data
A big problem with data lakes is that they’re source-aligned; it’s a bit of a case of “collect all the data and then work out what to do with it”. You have to do a lot of data cleansing, identification of bad data, and then work out what to do with it. The announcement of AWS Glue Data Quality should go a long way towards making this a lot easier, with support for the automatic generation of data quality rules, detection and notification of low-quality data in your existing environment.
Using governance to connect siloed teams
In most cases, “governance” means “you’ve gotta keep ‘em separated”, but the AWS take on it is similar to the modern perspective on Security teams: the role is to enable, through secure practice, not to be the “department of no”. This is challenging though, it takes time to identify what the correct controls are, and it’s time consuming to create the right mix of roles and permissions, and manage user assignments to those roles. Up to now, LakeFormation has been a good tool in the right direction to allow row, column, and individual cell-based permissions based on external tagging, but it hadn’t applied to Redshift; well, breathe free because the preview of Centralised Access Controls for Redshift Data Sharing has just dropped.
That’s governance over structured data, though – what about ML data? Yes, you can have some governance for that as well with the GA announcement of SageMaker ML Governance which provides a Role Manager, Model Cards, and a Model Dashboard to identify, classify, and manage access to all different kinds of ML data.
Perhaps most exciting though is the launch of Amazon DataZone – you could think of this as a fully-managed “data mart” type approach, where data producers can register, tag, identify, and apply metadata to datasets (eg how often it’s updated, what classification it has); data consumers in the organisation can then use DataZone to find data that could be useful for them, and build analytics, visualisations, ML models, and whatever their hearts desire on top of it.
Weird, internally-meaningful table and column names can have ML-generated business-meaning names applied to them, which will make discovery so much easier.
The demo is pretty swish, if you like a lot of clicking and pointing and then typing SQL queries into Athena, but at the very least it does go a long way towards solving the problem of “what data do we even have?”
Another big plus is that DataZone handles delegation of roles across org boundaries – if you collect and manage the data, but don’t know what it means, you can appoint someone else to be the “Steward” of the data, to manage the metadata without needing to worry about how the data is collected and stored. That’s a nice division of labour.
Connecting Across Silos
A problem with having data in a lot of different data stores is that you have to do all sorts of extract-transform-load activities to get it from where it is to where you can query it. Federated queries in Athena has been one way, and there’s further to go there as well, but if your data is in Aurora for example, then how can you query against Redshift without invoking Athena? Now, with Amazon Aurora Zero-ETL integration with Amazon Redshift, you can, and you also get a product with two products named in it!
Loading data into Redshift has traditionally meant setting up schedules, performing the ETL part, and then validating that the data has loaded correctly via some reconciliation process. AWS aim to make this simpler now with the release of Amazon Redshift Auto-Copy from S3, where you just plop data into an S3 bucket and it appears as if by magic in a Redshift table. What fun!
Data silos aren’t just internal to an organisation though; think about all the third parties that your business has data stuck in: Salesforce, for example. AppFlow has been around for a while, but the number of connectors has been fairly limited; today, the announcement is that Amazon AppFlow now offers more than 50 connectors including things like Datadog, Slack, Jira Cloud, Google Analytics, Linkedin Ads, and more.
AppFlow isn’t the only connector to get more though – SageMaker Data Wrangler gets some love as well with more than 40 new data sources added, including many of the same sources that AppFlow supports.
Access to data isn’t just about being able to find it – it’s also about knowing how to use it. Data literacy is critical, which is why AWS has been leaning heavily into the training side of things, with the launch of AWS Machine Learning University which provides educator training; there are colleges in the USA offering Bachelor’s Degrees based on this.
If that learning path isn’t your style, the existing AWS Training suite includes 150+ courses for self-paced learning, including getting started on DeepRacer which is “the fastest way” to get hands-on with Machine Learning.
Low-code and no-code tools
Another barrier to entry is the requirement to learn all sorts of high-tech stuffs: SQL, table joins, 3rd Normal Form, whaaaaaat? Hence the advent of low-code and no-code tools to allow less-technical business folk to build integrations and gain insights without having to go through a computer science degree first. There are still challenges with integrating this approach into a modern software engineering business, of course, with questions around audit trails, version control, and repeatability going completely out of the window, but from the perspective of prototyping and ad-hoc querying, it’s a good thing.
We hear again about Amazon Quicksight Q which allows you to do natural language queries – but this was announced last year, so it kind of feels like a “we have to have a low-code/no-code slide in here as well” attempt.
That’s All, Folks!
And that’s the wrap. From my personal interest, the announcements I’m most keen to explore are:
- Trusted Language Extensions for PostgreSQL
- Amazon DataZone
- Amazon Glue Data Quality
- Amazon GuardDuty for RDS; and
- Spark Jupyter Notebooks in Athena
What’s your favourite? If any of this sounds amazing, interesting, challenging, problem-solving, or a complete transformation opportunity for you or your business, give Cevo a buzz and we can help you navigate these sometimes deep waters.
Onward to the next keynote!