Power of Artificial Intelligence (AI) with Amazon Web Services (AWS)
The cost and efficiency of the cloud puts machine learning and artificial intelligence (AI) within the grasp of enterprises big and small. Help your organization tap into their power with Amazon Web Services. This article is a practical approach to leveraging AWS for AI-based applications across a variety of industries, including healthcare, finance, law enforcement, manufacturing, and education. Instructor David Linthicum introduces SageMaker, Amazon’s AI platform, and presents a variety of use cases that demonstrate current best practices, tools, and techniques. He shows how to build and train machine learning models with SageMaker, and integrate them into real-world apps. David also dispels some concerns around AI, such as cost and security, by showcasing real AWS solutions.
Learning Objectives:
- AI basics
- AI use cases
- AI application walk-through
- AI costs
- AI security
- AI governance
Nothing has been more life changing than the use of machine learning and the rise of cloud computing. The cost and efficiency of the cloud makes machine learning possible and now enterprises big and small are tapping into the power of machine learning and artificial intelligence allowing them to turn technology into a force multiplier for winning at business. I’m going to teach this guide using very pragmatic approaches based on current best practices and leveraging real machine learning technology you can use in AWS today. This won’t be another abstract high level concept, but a true solution that will teach you about the basic concepts of machine learning and how it leverages AWS and using AWS technologies. Hi, my name’s Dave Linthicum. I wish you pleasant journey in the learning about the world of machine learning on Amazon Web Services and what it can do for you and your cloud enabled enterprise.
AI on AWS
What does artificial intelligence mean to the world of Amazon Web Services or cloud computing in general? Well, AI systems, sometimes called machine learning and deep learning systems as well, they make your system think. Ultimately, you’re able to gather knowledge over time. So, we’re able to vine learning systems to traditional applications that typically didn’t leverage artificially intelligent systems and we’re also able to find patterns in mass amounts of data. The ability to think through what’s occurring within the data that’s really coming to some conclusions that we should be aware of. And ultimately, the ability to automate things that aren’t currently automated. We may have understandable knowledge skills today and we’re leveraging people who understand those skills and we’re able to hand off some of that work to artificially intelligent systems. The idea is that we’re looking at two categories of artificially intelligent cloud services on AWS. And we have the non native which are systems that are used within Amazon Web Services and outside Amazon Web Services. Those include TensorFlow, PyTorch, and Apache MXNet. In this tutorial we’re going to focus on Amazon Web Services native cloud services and specifically that’s going to be Amazon SageMaker.
What you should know about AI , ML and AWS
So what skills do you need to learn about AI on Amazon Web Services? Well first and foremost, we’re assuming that artificially intelligent systems are going to be a huge deal on cloud based systems for the next several years and these include systems that are built from scratch, net new, or systems that are migrated. Skills you really should have before taking this article include basic understanding of cloud computing and public clouds in general. What is Amazon Web Services? What is Microsoft Azure? What is Google Cloud Platform? A basic understanding of how data works, either raw storage of information or systems that exist in databases. Ultimately, processes around application development including emerging best practices. How security works, basically. How governance works and ultimately a good understanding of the business skills, including the requirements to leverage new systems and new technologies, such as artificial intelligence.
AI Basics
AI Processing
Why do machines need to think? Well, first and foremost, it makes our life better. We can do things and operate in more effective ways, such as having a ride sharing app figure out where we’re going to need a ride. Or the ability for a streaming service to, in essence, provide recommendations as to movies that we’re going to like. Or finally, and more importantly, the ability for a hospital to spot a fatal disease early and suggest an effective treatment, ultimately saving a life. So, where did this come from? Well, we’ve had artificial intelligence around for the last 30 years, it’s something that has been a part of computing since shortly after there was computing. The idea of having computing systems that think like people, were able to gain knowledge through experience, and were able to react to that knowledge. Machine learning is really a business instance of that, the ability to deal with transaction-oriented business applications and do so in very sophisticated and thinking ways.
It’s a sub-component of machine learning, but ultimately it’s more related to the way in which we use learning-based systems in business. And that’s going to be the primary focus of this tutorial. Then we have the concept of deep learning, the ability to, in essence, have a very deep understanding of something and the ability to analyze petabytes of information and have insights into that information that typically, humans can’t have, we just don’t possess the capacity for absorbing and dealing with that much data. And then where do we go from here? I mean, obviously, the jumping-off point for deep learning and machine learning and artificial intelligence is the need to provide business with the ability to do things better and faster. The ability to serve the human race with better compute services that are able to adapt to our needs.
We’re going to have a lot of technology that’s going to spring from this. Keep in mind, no matter what we talk about in this article in terms of machine learning, there is always going to be a basic process. We have data that’s inputted into a learning system, and that could be petabytes of information from a structured and unstructured database, it could be information being inputted directly from a user interface, but some way, somehow, patterns of data, or data itself, is getting into the learning system. Then we come out with a conclusion, or in other words, the ability to look at the learning-based system that we have and look at the information that’s fed into it, then make conclusions based upon the information. You know, such as the ability to recommend movies from your favorite streaming service, or the ability to recommend discounts from your favorite online shopping place. The ability to, in essence, look at all kinds of anecdotal information, lots of details, and discern conclusions that we can use in a business process, and that’s called the output.
Knowledge Creation
Part of understanding machine learning is understanding the building blocks and types of machine learning that’s out there. Ultimately, we have a few things that machine learning systems have in common. First, we’re ultimately looking at experiences or stimulus that are coming into the machine learning based system, that could be patterns of data, it could be interacting with humans, it could be anything else that’s causing us to discern patterns. Ultimately, we have to store these experiences in a knowledge base, and so once we create a model in terms of how we’re going to store the information around the knowledge we’re acquiring, the ability to take those experiences and store them in a structured way is a fundamental capability of machine learning based systems.
Then we have to find patterns there. Ultimately, we’re looking at lots of information to determine intelligence about the information, and typically, humans and business systems are going to ask machine learning based systems to find patterns that are meaningful in how we’re going to deal with the data. And then the ability to input the data, the ability to look at where the information resides, and the ability to hook up your machine learning based system to lots of different data sources, they could be data streaming, they could be massive databases, it could be real time systems that are connected to information, technology such as IoT, Internet of Things.
Ultimately, it’s about binding these things together so we’re able to take the data, find the patterns, draw the experiences, and react to the experiences, which really provides the common capabilities of machine learning. The basic types of machine learning are supervised learning, which is the ability to actually have variables assigned to the data, and the ability to learn from those variables assigned to the data, Unsupervised learning, where we’re determining clusters or patterns of the data without having the variables understood. Then reinforcement learning, where we’re responding true or false, yes or no, to certain questions that are being asked, so certain outcomes in the machine learning based system and having the machine learning system learn from that, much like we learn when we’re children, the ability to, in essence, figure out the input that’s coming back from people who are telling us if we’re doing things right or doing things wrong as a learning experience that we can adjust for in the future. So, if we look at supervised learning, we have task-driven, you know, ultimately, what is the next value that we’re going to find within this system.
For assigning values to particular patterns of data, we can discern from those patterns as to what the next value’s going to be. Typically, the target variables are known in a supervised learning system. Unsupervised means that we’re data-driven, we’re identifying clusters or patterns within the data, and the target variables are typically not known. And honestly, unsupervised training is going to be more valuable to businesses because we’re not requiring that every data point have known conclusions, known variables, in the system. It’s able to take lots of information and discern patterns that we can apply to particular business cases. Then reinforcement learning, the ability to react to environments, such as learning from mistakes, as we mentioned before, the ability for a child to react to training from supervisors or parents in terms of what constitutes good behavior and what constitutes bad behavior.
Another angle of this would be supervised learning is typically dealing with labeled data, or data that includes labeling information as well as direct feedback, in other words, we’re applying feedback based on the information, based on the outcomes, the true or false aspects of the data, and by doing that we’re able to predict future outcomes. Unsupervised learning typically has no labels, ultimately, we’re not giving it feedback, it’s typically a black box type system, and we’re able to find hidden structure within the data, hidden patterns that we’re able to discern and apply to certain situations. Reinforcement learning is a decision process, and ultimately, it’s dealing with a reward system, we’re either right or we’re wrong, and the ability to train a reinforcement learning system through this interaction is going to be its strong point, because in essence we’re dealing with it as if we’re dealing with a human, and we’re able to learn a series of actions or the ability to figure out correct outcomes based on information that’s provided.
AI and ML Applications
Ultimately, when we’re building machine learning systems, we’re trying to meet common requirements. We’re looking ultimately at what patterns we’re looking for and how they can be leveraged. For example, the ability to determine market trends, within marketing data or sales information, that’s within the enterprise. What patterns are occurring there that we need to be understood and how can we automate the use of those patterns to grow different data, to deal with different business systems, to leverage the information as a strategic advantage when looking at how we’re investing in products, services, people? Ultimately, the ability to deal with dynamic data and when that data’s going to be available.
Machine learning systems are valuable, not because they’re looking at a static amount of information, but because they’re looking at data that’s always going to be changing, their ability to adapt to the changes that are occurring in the data. Automation is a core capability that needs to be in here. Ultimately, what we’re doing is not only understanding information and coming to conclusions about information or data that are inputted into machine learning based systems, but the ability to, in essence, automate business processes in the back end so we can take corrective action or tune the system based on the intelligence that we have. We need to understand that data has to be available and when it’s available to the right system at the right time, it can be invaluable. Ultimately, machine learning systems are not only good about making conclusions, but the ability to produce those conclusions to business processes that really are able to leverage that data as a strategic advantage. Then, ultimately, the ability to externalize that to applications that are benefiting from this knowledge.
Not only benefiting from the date that’s there, which they all should have access to, but the ability to benefit from the conclusions that the machine learning based systems are making based on that information. We have machine learning and of course we have supervised learning, unsupervised learning, and reinforcement learning. Then under there, certain types of business applications, which are kind of more along the line with these three categories of machine learning types. For example, supervised learning: image classification, customer retention, fraud detection, diagnostics. The ability to deal with robot navigation and game AI, for reinforcement learning and the ability to do with unsupervised learning to leverage applications such as recommendation systems, like you would use when you’re using a web browser and it really kind of recommending things to you that you may be interested in. The ability to target markets using similar kind of technology and the ability to deal with customer segmentation, the ability to understand demographics and how to sell things in certain ways. So the idea is that we’re going to have machine learning which is kind of a base set of technology and the ability to look at the various applications which are able to be solved using machine learning based technology and the different types of machine learning that you use to solve the business problem. Supervised, unsupervised, and reinforcement.

AI and Cloud Computing
Ultimately every public cloud out there including AWS, Amazon Web Services, Google, Microsoft, Alibaba, and IBM have machine-learning components. This is a huge inflection of growth going forward. And the ability for those cloud-based systems to support machine learning-based systems really is going to go directly to the success of their users to leverage the cloud service to reach a business end. So, going forward, we’re see everybody support not only single machine learning-based systems, but typically multiple machine learning-based systems.
Very complex array of technology dealing with deep learning, machine learning, AI, IOT-based systems, edge computing-based systems, basically it’s an explosion of the technology currently. And the growth is going to be exploding as well. If you look at this graph, which is an estimation of the growth of machine learning versus deep learning versus other types of AI, we’re here ultimately in 2020 where we’re getting close to $10 billion in sales, and up to $20 billion in sales in 2021, and past $60 billion in sales in 2022 just for machine learning-based systems.
Ultimately what this is depicting is that everything we’re learning ultimately in this guide is going to be exploding in terms of its use. The ability to leverage machine learning as a strategic advantage for businesses and leveraging cloud as the ability to leverage machine learning, ’cause it makes things more efficient. We typically pay for a drink. Doesn’t cost that much in hardware and software because we’re not paying for hardware and software, and we’re just paying per use of the system. On-demand can scale. We have all the capabilities that are in there, and that’s setting up for the explosion that we’re seeing right now in the marketplace.
AI and AWS
Looking specifically at the Amazon Web Services offerings around just the machine learning based services, they separate things into three levels, Application Services, Platform Services, Frameworks, and Hardware. If you look at the top level it’s really about dealing with Vision, Speech, and Language, or the ability to in essence, absorb information into machine learning-based systems. Chatbots for example, such as Amazon Echo or Alexa, the ability to communicate using natural language processing, or the ability to communicate directly to your AI based system using methods other than simple data entry, but obviously information and the ability to bring information or data into those systems is going to be critical as well. Platform Services such as Amazon Sagemaker, Amazon Web Services DeepLens, Amazon Machine Learning, Spark EMR, and Mechanical Turk, are really designed to provide you with the ability to create the learning models, to create the knowledge engines, to create the core components of machine learning. Integration with various systems such as things you see in the Framework and Hardware is also on the critical path, including the ability to deal with Tensorflow, Gluon, Apache MXNet.
All the things you see in the bottom line which are really the base level services that are primitive level services that these Platform Services leverage in order to create the machines and store the systems and get at the databases that they need to pump the machine learning systems with the knowledge and the information that they’re going to leverage to make these critical decisions. Let’s take the very complicated slide that we just saw and bake it down to the simple components. First and foremost, the major Machine Learning System on Amazon Web Services is SageMaker, and while they have different systems either outside software that they run on Amazon Web Services, or supporting software that runs on Amazon Web Services that may support SageMaker and other subsystems, ultimately this is the primary way of creating machine learning-based systems on Amazon Web Services.
Ground Truth supports the ability to get data in the proper format, to get training data up and running, and you’ll find that as you get into Machine Learning Based Systems, that your ability to get training data correct and therefore training the model, is going to be something that it is on the critical path. Additionally they have Frameworks, they have Infrastructures, such as storage and compute, and even special services like IOT, DeepLens, or Video Analysis, things like that that are available to you as well.

Real World AI Applications and Use Cases
Healthcare
Okay now let’s look at some machine learning use cases. And we’ll first look at healthcare. So consistent with the requirements among all the use cases we have the method of input, we have the model we’re going to leverage, and output. Input, we’re going to bring in patient data, or data around a patient much like yourself, the ability to deal with respiration, the ability to deal with heart rate, blood pressure, perhaps blood work, all kinds of information related to the current health of the patient. And also any treatment that that patient may have undergone. For the model, we’re going to look at patterns of success in terms of treatment, and patterns of failure in terms of what things are failing. The ability for the patient to get better or worse. Output, we’re going to look at diagnostics, and ultimately the likely outcomes of the system.
How can we do a better job at diagnosing the patient with the problems that he or she has, and the ability to recommend treatments that increase the success of the likely outcomes. The solution looks like this. We have data that’s ingested, typically from patient data either from the clinical systems, perhaps wearable devices like a Fitbit. The ability to leverage an unsupervised machine learning system. The ability to in essence take massive amounts of data and discern patterns about them. The ability to automate the use of the information. And the ability convert actions to outcomes. The ability to look at, for example, what’s succeeding and what’s not, and make sure that we follow the patterns of success and not follow the patterns of failure. And then the ability to communicate back out to clinical applications, interfaces, both to receive information that are coming into the machine learning-based systems, but more importantly, to communicate things out to the clinicians and doctors, so they can get back to the patient with a recommendation of treatments that will likely solve any issues that he or she has.
The results of this are early detection, we’re able to, in essence, accelerate the amount of time it takes to find health issues within people, by looking at lots of information and understanding how that information exists in relationship to other information. For example, tens of thousands of other patients that have suffered the same disease from the same sort of characteristics that we’re seeing here in this patient. Outcome-based learning ongoing, the ability to consider success or failure, the ability to look at what’s working and what’s not, and the ability to constantly adjust the learning database, the knowledge base in terms of what is a desirable outcome, and which data is leading up to a desirable outcome. And the result of this was a 40% increase in survivorship. This a real-world situation, where we’re able to leverage machine learning as a force multiplier, in this case to do better at providing healthcare to people, which actually has a net benefit to healthcare, and thus humans or individuals like me and you.
Finance
Let’s look at another use case, in this case finance. Again we have input, model, and output. The input is going to be transaction data and fraud data, or the ability to look at a massive amount of transaction data that occurs in the daily processing, within a bank for instance, and the ability to look at aspects of that data which are declared fraudulent. The model would be looking at patterns of fraud, and the output would be ultimately detection of fraud and the ability to prevent it in a proactive way.
The solution is ingesting the data that we need, in this case transactional data and fraud data. And in this case we’re going to use an unsupervised machine learning-based system because we’re dealing with massive amounts of information that typically aren’t tagged as fraudulent and unfraudulent, and the ability to find fraudulent data within the patterns of the information, the ability to leverage automation, to leverage that data in proactive and useful ways, the ability ultimately to look at the outcomes of those situations and how we’re going to leverage them, then ultimately the ability to push them back onto the, in this case banking, applications any number of ways, how we want to, in essence, solve the fraud problem, both in a proactive way and a reactive way. Results ultimately lead to the ability to detect fraud in a reactive and proactive way, so we know how to spot it typically before it’s going to happen, as well as as it’s happening.
Ultimately we’re dealing with outcome-based learning, and this has to be ongoing, has to be something that’s systemic to what we do. Ultimately we’re leading into a result that 60% of the transactions that would typically result in fraud are not resulting in fraud, and that’s both because we’re able to detect fraud as it occurs, we’re able to proactively avoid it and basically put up governance or security limitations around people defrauding the system, and it’s leading to a 60% reduction in fraud, which in the case of banks leads to millions and millions of dollars.
Law Enforcement
Now let’s look at a machine learning use case for law enforcement. The requirements here are input, model, output. And in this case the input would be crime data. In other words, what’s going out there in terms of crime? Where? Who? How much? What type of crime? The ability to model the patterns of crime, and the ability to model the pattern of crime in such a way that we’re able to deal with detection, typically proactive detection, and the ability to prevent crime. Also, the ability to look at crime in progress, a reaction to something, or the ability to look proactively in terms of how we’re going to approach the issue. So, the solution would be data ingestion, or the ability to get the crime data into the system. And this time we’re using a supervised machine learning system, because the data is going to be tagged with the variables.
We’re going to know a lot more about the crime information in terms of who carried it out, and how it was done, and where it occurred, than we have anecdotal information that would be massive amounts of sales data or transaction data that would require an unsupervised machine learning system. Supervised is an option here, and we’re going to leverage that. The ability to deal with automation, or the ability to take the information and put it back to work for us. For example, if we’re figuring out patterns of crime that are occurring in certain areas versus other areas, the ability to make sure that the police cars are in those areas and therefore, they’re proactively fighting crime in those areas, therefore, the crime rate should go down. The ability to link actions to outcomes.
Ultimately, we’re putting things that we’re doing to prevent crime back into the system. How is that actually carrying out the objective of reducing the crime rate? That ultimately the ability to get this back into the hands of the policemen and the people who are enforcing the law, so they can act upon the information in a proactive way. For example, mobile-based applications which would tell police where to go and what to do, based on patterns of data they saw in the past, based on running it against a machine learning based system, based on guessing and having an educated guess as to where the crimes are likely to occur.
The result is, detection and proactive prevention of crime. Ultimately, dealing with outcome-based learning which has to be ongoing. As we get outcomes from this, this goes back into the model and it rethinks how well things did, and then how well things did not happen when we’re running that model, and it’s able to adjust accordingly. Ultimately, this resulted in 30% reduction of crime. Which is really what machine learning is there to do. It’s there to, in essence, augment the traditional thinking that’s coming from human beings with thinking that’s coming from machine-based systems. And in this case, we’re able to reduce crime by 30%, and save a lot of people a lot of trouble.
Manufacturing
Ok machine-learning-based application, this time for manufacturing. Again we have input, model, and output. In this case the input would be factory data, or information coming from the factory in terms of production, perhaps information coming from robots that build products, anything that’s pertinent. Also the ability to have reject data, the ability to figure out where there’s quality issues which cause the products to be rejected, either by the quality control inspectors, or by the customers themselves. Holistically we’re looking at factory behavior, and how the factory is behaving ultimately in tracking that behavior, and figuring out patterns. And the output would be reject prevention, the ability to in essence increase the quality control, or the quality in the production, and then proactive training and maintenance, or the ability to change things that are leading up to the rejects proactively so we can get fewer rejects and make more money.
The solution, we’re ingesting the data, in this case from the factory, with reinforcement of learning. We’re automating the various systems so we’re feeding this back directly to where they can be most effective. So in other words if it’s on a line manager, they understand where the rejects are occurring and where the quality issues are occurring, perhaps down to a person, or perhaps down to a machine or a robot, the ability to change those things proactively ongoing. The ability to identify the rejects, and identify the patterns of the rejects. And then finally the ability to externalize this through manufacturing applications that can be of use to people who are working in the factory. Now for example, factory managers may carry around mobile devices and they can be alerted as to when the machine-learning-based system is finding rejects and patterns that are leading up to rejects, and proactively solving the issue before the rejects actually become to be a problem. Now notice we’re using reinforcement learning here, instead of unsupervised or supervised learning. And the reason is, is because, if we’re going to figure out what’s causing the rejects, we’re going to be feeding that information back into the system.
In other words, if we’re producing a product, and that product is ultimately going to be a reject, then we’re telling the machine-learning-based system that this was a reject, and what led up to the reject. And then it’s learning by having the information put back into it, in this case, no you made a mistake, versus yes it’s not a mistake. And the ability to in essence adjust its machine-learning-based algorithms based on the fact that it’s learning through this interactive process, very much like we would learn as a human being. The results are, we’re able to reject origin identification, in other words we know what’s causing rejects to occur in terms of actual individuals and machines and perhaps even patterns. We know what to look for in terms of things or behaviors that are going to lead up to an abnormal amount of rejects, and therefore the ability to proactively solve those problems.
The ability to look at outcome-based learning, and the ability to do this ongoing. As we’re feeding information back into the machine-learning-based system, it’s correcting information that’s coming out of it, and it keeps reiterating itself, it keeps reflowing back into the machine-learning-based system, so it’s always getting smarter and it’s always able to adjust itself based on an ongoing knowledge experience and learning system that goes on for a long time. In this case there was a 75% reduction of rejects and returns. Successful outcome.
Education
Now let’s look at an education use case for machine learning. Again we have input, model, and output. In this case the input would be student information, or information about a student and their grades. And so this would obviously have a one to many relationship. One student for many grades that we’re tracking. The ability to model student performance, and the output would be failure prevention, the ability to look at ways in which we’re going to improve the failure rate of students, drop out rate, flunk out rate, but the ability to proactively select and engage in early intervention with students that are getting off track.
Again, we’re ingesting data. In this case we’re using the type of machine learning which is known as supervised learning, or the ability to, in essence, track information which is tagged with the appropriate conclusions or variables, in this case we know the grades that the students received. So we’re not tracking massive amounts of information where we have to deal with clusters, where we would leverage unsupervised learning, but we’re dealing with known data that’s fairly well structured and we know the patterns and we know what’s occurring. We’re dealing with automation of this information, the ability to feed it back into systems so we can take proactive and immediate action on students that are getting off track. Also, the ability to understand patterns of failure. What’s leading students to drop out? What’s leading students to flunk out? And the ability to link all this information to the admissions and notification to professors so they’re able to engage early in getting troubled students back on track, who may be going a bit off course.
In other words, increase the graduation rate, increase the effectiveness of the education system. The results are proactive student management. The ability to, in essence, leverage machine learning to figure out when students are likely to get in trouble. Typically before they get in trouble. The ability to look at variables like attendance. The ability to look at variables like financial issues. The ability to look at variables such as grade patterns that seem to be declining. And intervene and get with the student so they can improve their capability to learn, improve their study habits. And have people help them along the process of education, and do so before they get in trouble. By the time they get in trouble it may be too late.
Again, outcome based learning, which is going to be ongoing in this system. All of the things that are succeeding or not succeeding are going back into the machine learning based system. So we may have intervened with a dozen or so students. We may have succeeded in six of them not dropping out or flunking out. However, six of them did. The ability to look at the patterns of failure and the patterns of success within that population and then put that back into the machine learning system is an imperative to this. Ultimately the results were pretty good, 15% reduction in students leaving the university for academic reasons. And so we’re able to look at this in terms of a education base use case, as the ability to, in essence, not necessarily get to the bottom line of profitability but get to the bottom line of the effectiveness of the education system.
AI Application
Requirement
Now let’s consider what it takes to build a machine learning based application. There are several steps, ultimately it’s around requirements, it’s around design, it’s around building, it’s around training the models, and then deploying the models into production. Ultimately this is really about determining what the requirements are for your machine learning based application and this is important because if we don’t understand the requirements or what the application or the model is designed to do then it is going to be very difficult for us to meet the needs of the business.At the end we have to ask ourselves a few questions. Why are we building this model? What problems are we looking to solve? Ultimately what applications, and there’s typically going to be more than one application which is connecting to the machine learning based model.
How are we going to leverage the training data? Where’s it going to be generated? And then how are we going to operate the model in production. In other words, the long tail of this, or the success or failure of our machine learning based application is the ability to operate these things long term. And then we also need to consider the very important aspects of security. The ability to make sure that our machine learning model isn’t going to be compromised in some way and it is vulnerable as well as governance, meaning that we’re putting guard rails around the utilization of the machine learning model. The basics we’re dealing with are first the data. We have to define what that is and what it does.
Now this is separate and distinct from the ability to define the behavior. We have to remember that we have behavior on top of data or ways in which we’re interacting with the data. Remember, we’re in essence creating human like intelligence with the machine learning model which is going to be trained by the information. It’s going to determine the behavior of the model. We need to figure out the knowledge and how we’re going to build knowledge over time. In other words, our machine learning model is going to get experience from the training data as well as interaction with the core systems or any way you want to train it over time and it’s able to store that information to be retrieved, very much like we do with our own brain neurons. The ability to bring things up and process it based on experience. And then how we’re going to interact with applications. Obviously we’re building machine learning based systems to support machine learning based applications. Those can be business analytics, it could be inventory control applications, they could be mobile applications. Anything that needs to leverage machine learning system to in essence access knowledge and experience which is combined with the data.
Design
Once we understand the requirements, now we move on to design of our machine learning system. A couple of things to remember is this is ultimately us looking at the requirements, and then creating the design or the basic structure of how the system’s going to work. We need to understand that we’re dealing with data, we’re defining the learning model, we’re picking security systems, we’re picking governance systems, and the application of governance through policies, we’re looking at the technology that’s a best fit for what we’re doing, and ultimately coming up with a testing framework, whether that’s going to be real-time testing, penetration testing, security testing, performance testing. We have to test all aspects of our machine learning based system. Then figuring out how we’re going to push this into production. In other words, how deployment is going to occur.
Are we going to do it continuously, where we’re pushing out all kinds of new versions of the knowledge base as they are updated, or is this something that’s going to be versions? So the basics are, we focus initially on the data design. Keep in mind that all machine learning based systems are going to utilize either existing data bases, that abstract training data, or they’re going to be training data that’s created specifically for machine learning application. How is that data going to be designed? How are you going to populate that data with training data? You need to design the knowledge base. In other words, how we’re going to structure it, in terms of its ability to think about things that we’re asking you to consider. Ultimately, this is about the ability to build experiences over time, and react to those experiences in real-time ways. The way in which we design the knowledge base, or design the brain of the system, is absolutely critical in us achieving the goal of having the system think correctly.
We have to think around how we’re dealing with training, the ability to provide the training data that’s needed, but also different ways and approaches as to how the thing’s going to gather experience over time. How the knowledge base is going to be built over time. Then, what infrastructure are we going to leverage? If we’re running on the cloud, we’re going to need storage systems, we’re going to need compute systems, we’re going to need CPUs, GPUs. Other things that are really on the shopping list for us to run our machine learning based systems. What are those systems? How do they run? And finally, how we’re going to structure the applications for accessing the machine learning based systems. In other words, what interfaces are we going to leverage? If we’re typically building machine learning based system, or many different applications are going to utilize that system, then how are we interfacing, how are we maintaining security, how are we maintaining governance? Basically how we’re allowing the application to interact with our machine learning brain, as well as the data, to retrieve value to those applications directly.
Build
Once we define the requirements and have done the design, we need to determine how we’re going to build the application. So, ultimately this is asking yourself a few core questions. First and foremost, how we’re going to configure the system, what enabling technology we’re going to use, what’s our approach to be? We need to figure out how we’re going to provide behavior to the machine learning-based systems in terms of programming, perhaps programming languages. Are we going to leverage Python, JavaScript, something else? We need to figure out how the data is going to be physically implemented, how it’s going to be structured, and we need to figure out how we’re going to test the environment, how we’re going to test the machine learning-based systems and knowledge base as well as tune it.
The ability to adjust it or fine-tune it as to what it needs to do during operations. And then if we’re a modern enterprise, we typically have DevOps around. The ability to fit it into a DevOps toolchain needs to be done at this point. We have to deal with the core issue of the machine learning system architecture. What enabling technology are we going to leverage? The logical architecture and the physical architecture. The logical structure in how we’re going to set the thing up and then assign it to particular technologies, such as SageMaker, SageMaker Ground works and other subsystems in the case of Amazon web services. We need to look at how we’re going to structure the information and how we’re going to design the system. In many instances, we’re extracting data from existing databases that have their own structure, but we’re also leveraging training databases that we need to create. Need to figure out how we’re going to build the actual knowledge base, the knowledge model.
How it’s going to be structured and how it’s going to function. Ultimately, how we’re going to deal with adding behavior. In other words, how we’re going to program the system to carry out certain tasks. Also, how we’re going to interact with outside applications. How these applications are going to access the services of our machine learning-based system and how we’re going to manage the access to those services. Then ultimately how we’re going to structure tests, either automated or not. The ability to test the knowledge base to ensure that it’s providing good knowledge training and the ability to externalize that knowledge to the attached applications.
Train
The next step in building our machine learning application and one that’s unique to machine learning is the ability to train the models that we defined. Keep in mind that we’re typically dealing with raw data. We’re not creating net new information, we’re training the models with existing scads of information. Huge amounts of information that are already in the business that we’re able to leverage to train the knowledge base. So it has experience from existing data. It can discern patterns. Either supervised or unsupervised. Out of that raw data, we need to create training data. The ability to, in essence, bring information into the model that’ll allow the model to gain experience and either we may have to tag the data, in some instances, it may be predefined from some data services but a way in which we can get information into the knowledge base and have it understand more about the world around it based on the information that we feed it. Ultimately, it’s not just creating the model but also tuning the model.
We have to understand that we’re going to get erroneous information out of the machine learning-based knowledge base initially and that’s to be understood and so our ability to adjust certain parameters so it’s able to do its job better is a matter of tuning. And ultimately, we have to undergo testing as well. We have to do security testing, we have to do performance testing, we have the ability to, in essence, ensure that the machine learning-based system is able to live up to expectations. The basics look like this. We have the test data that goes into model training data. And ultimately, the test data is the raw data that we feed directly into whatever database is going to be leveraged to train the model.
That’s going to vary a great deal in the fact that the test data may be structured, it may be unstructured, it may be out of an existing database or it may be out of additional system such as information coming off a device in the case of IoT or Internet of Things. And that data in turn trains the model. The model training data where it’s tagged and specifically designed and in a format that will allow the model to be trained. And then obviously the information goes into the model. We’re getting data into the system. And that becomes the test data which becomes the model training data and which becomes the model. And the output is going to be the predictions that the machine learning system is able to do. So ultimately, if we’re looking at this as a core input/output system, it’s very simple in terms of its structure. Information goes in, and then we’re able to make predictions based on us asking questions to the machine learning system as to what we think the data means.
Deploy
The final step in the process is deploying the machine learning application into production. So ultimately, this is typically going to be real-time. We’re going to move the machine learning instances out in production as they become available. We’re looking at other deployment capabilities, such as batch deployment, the ability to load systems up with data that may be analyzed using the machine learning system. So, we can do so in an interactive real-time way, or we can do so in a batch way. We have to look at operations that are innate to data. In other words, we’re dealing with databases, and we’re dealing with training data, and when you have to operate those systems, we have to back them up, we have to do business continuity disaster recovery systems on them, we have to secure them, other operations have to occur on that data. We have to operate our knowledge base, as well. We have to maintain the experience base over time, and basically, it is a base of information very much like a database. Even though it’s structured differently and maintained differently, we still have to back it up, we still have to maintain it to the directions from the software supplier. Ultimately, we have to deal with security, the ability to look at how security is going to relate to the operational thing, and maintain security operations in a proactive way. Keep in mind that your knowledge base that we’re building has very sensitive information in it. And ultimately, you need to ensure that that information is going to be secured from outside intrusion. Then the ability to place guard rails around how we’re going to leverage the knowledge base.
Keep in mind that we can get in trouble with machine learning-based systems if others are allowed to teach it incorrect things. And so, it’s not about denying access to the system in many instances, but it’s putting guard rails around what individuals and what applications can do to the system. So production, we have to consider deployment, how we’re going to move the machine learning system from a staging area, or a development area, out into real production systems. You have to evaluate the system as it’s moving out, then ultimately we have to do kind of the operationally-oriented things, such as monitoring and management. The ability to look at what’s happening in the system, and ensuring that things are in good health and correcting them proactively if we find issues. And the ability to manage things such as software updates.
The ability to manage business continuity and disaster recovery, the ability to do the blocking and tackling it’s going to take to maintain a system in production. And ultimately, security and governance become core to what we think about to round-out how we’re going to deal with machine learning. So, security has to be proactively maintained, we have to monitor the system over time. If we detect intrusions, the ability to operate with countermeasures to make sure that people don’t get into those systems. And the more systems are out there, the more vulnerabilities there are going to be there to be exploited, and the more we need to be proactive with security. Governance is the ability to create policies around utilization of your machine learning-based systems. So, it puts guard rails around utilization of resources. Such as the training database, or the test database, or the raw database, or the knowledge engine unto itself. And ensuring that people who are accessing it can certainly do so, but have to do so around rules and regulations that we put into the policies in terms of what they can do when they access the knowledge base, the database, the test data, the training data.
Performance
Now let’s look at other considerations when looking at Machine Learning based systems. First is Machine Learning Performance. So, Machine Learning Performance is really consistent of the ability to execute the models in a very efficient way. The ability to deal with information storage and retrieval, the ability to interact with whatever platform it’s running on, how we’re going to tune the models and tune the data, so it lives up to the expectations of the business, how we’re going to deal with security based systems, such as encryptions, how we’re going to put guard rails around our Machine Learning based system and data in turns of governance, and how we’re going to deal with specialized systems, such as Internet of Things or Edge-Based Computing. So within the realm of performance, in terms of Machine Learning based systems, we have a performance chain. We have the interface to the model and the data. We have the output of the model and the data.
For example, our interface may ask a question, provide us a weather prediction, based on massive amounts of information and based on the Machine Learning model that we set out. The output of that would be a prediction and then we have to transmit that back to the application which would be another interface. So the latency, or the performance, is defined in the amount of time it takes us to invoke an interface, gather what’s needed from the model in the data base, output that information in terms of answering the question and then transmitting that information back to the application. What’s important here is that model execution has to be efficient. It’s typically where your latency is going to be.
We need to make sure that we’re assigning enough processing and IO systems to execute the model. You need to make sure that the training data is in an efficient state. It can be gathered quickly to train the model. Application interfaces are often a point of latency. You need to make sure that whether we’re dealing with APIs or User Interfaces, that they’re able to communicate effectively and efficiently and the ability to manage the various systems. So we can shut things down or speed things up, allocate additional CPUs, Performance, IO systems, based on behavior. And the ability to monitor everything. Ultimately, if we’re going to look at performance, you’re ability to monitor to spot issues and you’re ability to leverage management layers to solve those issues is critical.
Cost
Let’s now look at costs. The costs of running machine learning based systems are really several distributed costs that exist within a cloud-based system. Cost breakdown is typically going to be how we’re storing the data. And that could be within a database or raw storage device like object storage on S3. The ability to allocate the correct processing including GPUs and CPUs. The ability to leverage the machine learning system itself. So in our case, leveraging SageMaker or perhaps another machine learning service that happens to run on a cloud provider when leveraging. And then other services that are needed. We have data analytic services, we have data loading services, we have the ability to provide data in a training state, based on services that exist in the cloud. And we have IOT services, we have other things that exist that may be coupled to our machine learning service that we have to add because they’re going to charge us for every service that we leverage within the cloud.
Keep in mind that this is confusing stuff. Ultimately, we’re thinking in many dimensions and that if we’re leveraging machine learning services, we’re typically leveraging premium features of the cloud. And that ultimately, it’s in turn, leveraging infrastructure services such as CPUs and storage systems. So, just the cost of leveraging the machine learning system is only part of it but it’s all the other services that are combined. And they’re both at different rates and sometimes in different increments and certainly in different ways. And so you have to keep track of your costs. Typically, cloud computing systems provide us the ability to do that but you’re going to have to come up with your own methodology to make sure that costs don’t get out of hand. This is a sample cost of SageMaker. We’re charged via GPU instances with a current generation and you can pick large, extra large, or really extra large and you can see the prices accordingly.
Note that, the large instance that we leveraged for the demo is only $1.26 per hour. The extra large instance or eight extra large is $10.08 per hour and then the 16 extra large instance is $20.16 an hour. Obviously you’re going to need the larger instances to process more information and then in many instances, enterprise applications are going to cost you lots of GPU instances to get to the scale that you need. Your job, as someone who’s building these systems is to pick the most effective and efficient cost-effective ways, in terms of processing your machine learning based system. So you have to consider cost, there’s some cost modeling involved. And understand that you picking the biggest instance, that’s probably bigger than you need is going to run into some cost inefficiencies. You have to pick the instance that’s right for your use case.
Operations
Keep in mind that machine learning operation, or MLOps, is only part of the solution. Your ability to run this thing in an operational state longer term, is where you’re going to be judged. So, ultimately, your ability to pay attention to the operational aspects of running a machine learning based system in the cloud is going to be geared to success. So, keep in mind that we have certain components. We have to deal with security, we have to deal with other things, such as governance, and the ability to deal with policies, we have to deal with information, including data, we have to deal with platforms, the ability to run things on CPUs, perhaps it’s a Linux-based platform, or Windows NT-based platform. The ability to deal with APIs, service interfaces.
The ability to manage performance correctly. The ability to make sure that we’re running cost efficiently. And the ability to deal with other things that are going to pop up. IOT-based systems, edge computing, mobile computing, all these things that come in to the management of a machine learning-based system over a long period of time. So, understand that we’re dealing with many dimensions. Typically these things are going to be turned over to the CloudOps team, or cloud operations team, but ultimately, if we’re building these systems, we need to know how to define operations for the people who are going to operate them longer term. So keep in mind that CloudOps is about dealing with native cloud services, storage compute databases, security, things like that. Usually through APIs and then public cloud resources. We have to deal with security operations, we have to deal with governance, we have to deal with development, we have to deal with databases, we have to deal with performance, and then now we have to deal with machine learning-based operations. Which are part of the stack, so understand that each of these responsibilities is going to be part of a CloudOps team, or monitoring infrastructure, and we need to tell cloud operations how to deal with security, all the way to machine learning.
When we turn our system over to the operations team, we typically have to give them a playbook, or the definition of how the thing needs to be operated. And we’re dealing with humans, and we’re dealing with machines that are interacting with the systems. Keep in mind that these are going to be set up in such a way that they have to have a long tail operational needs in dealing with other systems that are communicating with them, as well as human beings, and users in attached applications. So, its part of a larger issue, and ultimately, we’re thinking in terms of the complexity of a typical enterprise. And if you look at all these various services that are on the screen, ultimately, all these really kind of come in to play with how we need to run an enterprise in the cloud longer-term. And notice the cognitive aspect of it. We have to deal with the common learning framework, knowledge integration, and cognitive operations. The ability to run machine learning-based systems becomes an integral part of how we run enterprise infrastructure and enterprise IT going forward. And it’s going to be part of the problems that we need to solve, but also part of the solution.
Security
Security is extremely important to machine learning based systems considering that if the system is breached, they not only have access to very sensitive information, but sensitive knowledge engines as well. We need to be concerned with security really as something that’s systemic to the machine learning based system. Typically, at the heart of this is going to be directory services. It could be active directory LDAP based systems, or the ability in essence deal with identities in some sort of a structured way. We have identity access management, or the ability to manage the identities, dealing with security issues, such as encryption, the ability to deal with specialized security systems, such as multi-factor identification. We always have to deal with things that are compliant, dealing with rules and regulations, standards that are invoked, and the ability to deal with auditors, or people who are auditing the system, either automated or human, and the ability to automate the security systems.
Keep in mind that security is only effective if it’s proactive in nature. And you have to automate the systems to be proactive. So directories in terms of security and machine learning based systems are key. And we have active directory, have LDAP, and there’s other directories as well. And basically, it provides a common database that’s well structured that’s going to track machines, applications, systems, data, and human beings and how they interact one to another. So you’re able to through this identity access management system, configure and reconfigure whether or not humans are authorized to access certain pieces of data, whether certain machines can access certain pieces of data, whether certain humans can access some sort of the piece of the knowledge engine. And we’re able to turn things off and turn things on at a very fine grain level.
So ultimately, we’re dealing with a stack here. We’re dealing with identity access management. We’re dealing with encryption services. We’re dealing with other core systems, such as multi-factor authentication all the way down to security automation. And so, the directories become the jumping off point with how we’re going to secure the machine learning based system because they’ll identify to us whether or not certain systems, applications, or human beings are authorized to get into the data and get into the knowledge engine.
Governance
Cloud governance is the ability to put guardrails around resources. Either coarse-grain resources, such as CPUs and storage, or fine-grain resources, such as APIs and services. So, cloud governance provides us the ability to monitor and put policies around use of things that exist within the cloud, so keep that in mind. And everything that can be accessed by a human being should be able to be governed. So, we have the ability to deal with service governance and resource governance, as well as link security with governance and the ability to put compliance policies around utilization resources, which is tremendously important when you’re dealing with machine learning systems, because in some instances the data we’re storing, even the training data, may have some compliance rules and regulations that are applied to it. And the ability to deal with governance tools to monitor and manage and put policies around resources and services. And ultimately the ability to operate governance systems longterm. So, ultimately deploying the governance systems and operate them in such a way so that they’re bound to the machine learning-based system. So, not only do we have the security systems that are bound to it, but that we have the ability to place guardrails around utilization of the data or the knowledge engine.
So, policies are useful in the fact that we’re dealing with governance tools and they allow us to write little, mini programs in terms of defining behavior, in terms of how things are going to be accessed, such as resources, such as a knowledge engine holistically, or a particular rule that we’ve created. And using these policies, we’re able to deal with resources through data or whatever thing that we need to put guardrails around, utilization in the particular cloud systems that we’re dealing with. And so, in dealing with machine learning systems, all of these really kind of come into play, both the machine learning systems consuming these resources, certainly data, applications, security, and services, or the machine learning system producing services that maybe need to be governed.
So an example would be abstraction, repository around a set of policies, and we may govern services, some of those existing outside the machine learning system, some of them existing inside the machine learning system. And ultimately the service governance system is able to monitor and manage utilization of those services, and of requesting a service, such as a service from the machine learning system that we built, it’s going to validate to make sure that that person or calling application is authorized to invoke that service, and if so, they’re doing so within policy. You know, perhaps time of day, perhaps level of access, perhaps bandwidth.
Next steps
So a couple of resources out there you can leverage to understand more about machine learning and also more about SageMaker. First, we recommend that you go to the Amazon SageMaker site. It’s a very good place to figure out how you can get onto SageMaker, how you can build machine learning based applications and really updates around machine learning as it’s occurring on Amazon Web Services. Another great resource would be MIT News and there you can find information on the topic of machine learning. As you can see they have a lot of very insightful articles, such as automating artificial intelligence for medical decision making and other applications that are emerging around the world of machine learning.
These are great resources and you should bookmark them and make sure you’re going back at certain times to look at the updates. I typically check MIT News every other week. So where do you go from here? Well a few things we recommend is number one, get on Amazon Web Services. Trial accounts are typically free or almost free and try out machine learning based systems including SageMaker. You should get right into creating a SageMaker model as you saw within the demo and really kind of build an application based on a lot of the pre-built algorithms and models that they have in place and learning how it works, and learning how to implement. Make sure to run through the examples. There are many different examples that Amazon has out on their website. Make sure you understand the differences between supervised and unsupervised learning and the ability to teach your models in different ways in order to get to a successful use case for machine learning. Really, this is about doing something. So you’re going to learn machine learning, SageMaker as a product, and other machine learning based systems. But just trying it, getting out there and building the models and building the applications and seeing how it works for you.
Author:

- David Linthicum
- Chief Cloud Strategy Officer at Deloitte Consulting