AWS SageMaker studio with statistical model building, training and deployment along with the hyperparameter tuning .

Implementation Runbook for sagemaker studio build, train and deploy.

While working on a project, I was requested to build a statistical model on sagemaker studio. If you are someone who wants to know how or you are also facing the same situation than don't worry, I am here to guide you through it and by the end of this story you will be able to do the same . I hope that you guys already know, how to do statistical modelling, What are IAM roles, VPC, S3 buckets, how to make make and use them, what is your region when you are working in AWS. And please don’t mind my snips.

Step1. Please sign to the AWS console and click on “Amazon SageMaker” or you can search the same in the search bar.

*AWS console when you sign-in , these are mostly used services that’s why I am getting these many options*

Step2. Click on the Amazon Sagemaker Studio.

Step3. Add a user and name, click on the created user and go for ‘Open Studio’ .

Step4. Create a project if you wanted to , or if already created

Please Clone both build and deploy repository.

Choose the last one for build, train & deployment. And provide relevant details to it like name and description of the project that your are working on.

Step5. Once the project is build you can go to ‘Sagemaker Components and registries’ & ‘File Browser’ on the extreme left and check your project .

Step6. Go to ‘File Browser’ and click on your project . There you will see build and deploy modules created for your project. These are default ones from the AWS to build ‘model build’ pipeline and ‘model deploy’ pipeline . The default models are working(have build and deploy) and you will change these details with your project one get your project on Studio.

Step7.Under Build you will see couple of files and folder structure lets explore them one by one.

The codebuild yml file have details about the sequence of execution . Here you will see build commands and then “run-pipepline” which will run your pipeline , the default name of the pipeline project will be ‘abalone’ but you can edit this with the project name you wanted to run , as shown below. After this some of the roles are provided , then comes the region , pipeline and the model artifacts which will be saved when the code build will run .

Then the files ‘README’, ‘LICENSE’ , ‘CONTRIBUTING’, setup config and py and tox.ini will be later used. (we will not discuss these here.)

Step8. Now lets explore the folder/dir pipelines. Here you will see below details

And in abalone you will see the below files , the untitled one is created by me to see the working of any specific code if they fails .

Lets talk about the python files inside the abalone directory .

preprocess.py is responsible for preprocessing of the code and evaluate.py is evaluating the abalone model performance. pipeline.py is the backbone or base for these code files to simultaneously . These are default one and have default code in it. What we will do is when we have a notebook code which has all the details from data ingestion , data wrangling, data preprocessing, model building and training with hyperparameter tuning, we will import these details from notebook to make the project on AWS.

Step9. Open preprocess.py and check the code over there, the needed library are imported for the preprocessing of the data. To get the data from S3 bucket also if you are getting s3fs error so you can implement the below code as well.

Snapshot from preprocess.py

After this the preprocessing code starts, data ingestion from the pipeline.py. The input data below is the same data that is passed from the pipeline.py , here we can pass multiple files to preprocessing code if needed.

Snapshot from preprocess.py
Snapshot from pipeline.py

Here, in pipeline.py we can provide “ProcessingInstanceType” , “TrainingInstanceType” as the computation and also provide the address (INputDataUrl is the name assigned to it and default_value is the address)for the S3 bucket where the data files are kept. We will assign them here and will use it in preprocess.py.

Once the data is downloaded , you can start the preprocessing as per your code and requirements. once you are done with preprocessing, feature engineering and all the steps needed to build your final data for model building .

Snapshot from preprocess.py

First the data is split in X_pre and y_pre and then got concatenated and again got split in train, validation and test and saved in the model artifact bucket on S3. Now the data is split and saved and will be used for model training but here we need hyperparameter tuning before the trained model get saved. I have tried the hyperparameter tuning from “jobs” as well outside the studio but that also didn't worked as the whole pipeline would not know whether the tuning is working or not, also we can hung the internal process for some time but that is also not a good solution for a running business as you can’t let your client know that every time your model is getting trained the model will be hung for some time. So we will tune the hyperparameter in studio only with the below code & file adjustment.

Step10. Create a file with any name, I have chosen parameter-tuning.py you can name it any thing , then open your pipeline.py and check the code where the training starts, join the parameter-tuning.py in the pipeline.py and now when the model spilt the data in train, test and validation we will first tune the hyperparameter and then pass it to model training .

Pipeline.py code where the training steps are defined
Pipeline.py code where parameter-tuning.py is joined
Code for parameter-tuning.py

Step11. Import the needed libraries for parameter tuning. Define a fn for hyperparameter tuning . The below code is for multiclass classification but you can make you own as per your need.

Specify the grid search details.

Pull the data from S3. Read your training and validation data. Split in X and y . Provide the params

Assign the values to the best parameter and get them in logger info for best practices and to check in the tuning which parameters are used .

Save the model to the S3

One this is done check the pipeline execution will evaluate the model build and show us the evaluation score . The below snapshot is from the Components and registries → Pipelines

The screen shot from the execution where the model is evaluated and registered .

The logs for the successful execution will be checked on the cloud watch.

I will add more details here and upload all the code github and will share the link .Till the time please write to codelokiyt@gmail.com for any questions.

And yes you can follow Julien simon , Amazon web series , Stephane Maarek and Emily webber . However they haven't included this part in any of there videos. So I thought to share this.

Hope you’ll like it. Thanks

Lokesh
Data Scientist @Useready

Data Scientist @Useready.