Truly Managed Spark Job Environments With Amazon EMR
On weekdays, I have a day job that pays my rent (hopefully buying a house soon), internet, and occasional visits to nice restaurants (pre-social distancing period). However, on weekends I spend time speaking to people who want to get into data and AI world as well as figuring out ways to improve my engineering work — sometimes for other engineers as well.
Ever since the beta release of Amazon EMR 6.0.0-beta, I had always wanted to simplify my work. This weekend, everything changed, the first major release of Amazon EMR 6.0.0 was announced and my Eureka moment happened. Ever since I started deploying Spark jobs on Amazon EMR, my goal had always been to write my ETL jobs in self-contained environments without thinking about networking details on my AWS Cloud environment. One could argue that Amazon Glue ETL jobs provides this, but I argue otherwise. This is because in experience with Amazon Glue jobs, I have to properly package external libraries as zipped files, save to Amazon S3 and specify the S3 path when executing my Amazon Glue Job. Can this be made better with containerized environments? — Yes!!
Amazon EMR 6.0.0 comes with Docker compatibility, allowing the deployment of Spark jobs via docker containers. However, I still have to do the following:
- Create AWS cloud network resources. E.g. VPCs, security groups, and everything in-between.
- Create an Amazon EMR Cluster with the “right” cluster configurations.
- Write and dockerize my Spark job.
- Submit my dockerized Spark job.
- Terminate the cluster once job execution is done.
From a developer experience perspective, I care less about steps 1, 2 and 5. My goal is get my “job done”, which I achieve with steps 3, and 4. Hence, welcome my new weekend project:
I have not found a name, as such will refer to the project as “it” subsequently. I’m looking forward to your comments on possible names for “it”.
What does “it” do?
“It” allows seasoned big data engineers and new entrants into the field to focus on what they love doing — writing and deploying Spark ETL jobs. The reason I built “it” was because I wanted to simplify my big data ETL jobs development experience on AWS. As such, I only focus on writing my Spark jobs any language of choice, dockerize it and snap my fingers — you get the idea.
How does “it” work?
“It” is available in two ways. You can submit your dockerized Spark jobs through the web interface or via REST API. To get started with “it”, you deploy the base infrastructure via AWS Cloudformation and as an output, you get the the URL to a secure web interface. Authentication is managed with Amazon Cognito. Once logged in, you see as web interface as below:
Create a docker image either on Amazon Elastic Container Registry or a publicly available image in Docker Hub. You can follow the steps in this AWS Blogpost to do this. Once you have completed this, it’s time to submit your first dockerized Spark application to “it” via web interface with the following steps.
- Select the image from the dropdown list. This is from the Amazon ECR docker images in your AWS Account.
- Specify the amount of CPU memory resources your Spark application might need.
- Specify your job schedule.
- Click deploy.
Using the web UI every time you have run your Spark applications is not fun. As an engineer, what if I want to add this to my CI/CD pipeline — I strongly recommend Amazon CodeBuild. You would simply make an HTTP POST request to “it’s” REST API and Voila! You have automated deployment in place. See example REST API request below:
$ curl -X POST -H "Content-Type: application/json" -d '{"dockerImage": "", "numberOfVCPUs": 32, "cronSchedule": "", "jobName": "myDockerSparkApplication"}' http://it-rest-api/submit-app
How do I get started with “it”?
“It” is currently available for early access. For now, I will not be making the source code available for free yet. This may change in the coming weeks. Contributors are also welcomed; with GitHub’s unlimited access to teams.