AWS Glue is a relatively new, Apache Spark based fully managed ETL tool which can do a lot of heavy lifting and can simplify the building and maintenance of your end-to-end Data Lake solution.
- It is the combination of two major functionality groups.
The Data Catalog, gathers, maintains and publishes metadata about data stored in AWS or elsewhere. It is automatically kept up-to-date by crawlers, small executable programs, which connect to the data stores, progress through a prioritized list of classifiers to determine the schema of your data, and then create metadata tables in your Data Catalog. Metadata, including the Relationalized data structure, can be generated in matter of minutes. This information is then being made available for consumption, and can be used to build ETL jobs, or to query the data through server-less services (Athena, Redshift Spectrum) immediately at its resting place. - The ETL service is for building ETL jobs on of the foundation the Data Catalog provides. It also pre-generates PySpark code, can provide provisioned developer endpoints and sports a flexible job scheduler and job chaining engine. With the combination of possibilities provided by Glue and by the greater AWS ecosystem the creation and maintenance of a Data Lake via event triggered, smart, chained ETL jobs becomes not just a streamlined, sustainable solution, but a viable alternative to maintaining your own bespoke Data Catalogue via API’s.
Today example is a relatively simple AWS Glue pipeline which loads semi-structured data from S3 upon arrival into a relational database destination. A pattern like this can be easily used to maintain near real-time data marts in your DWH for storage or BI purposes.
This guide assumes that you have a working AWS account and you have appropriate level of understanding of the basic AWS services, and you also have access and privileges to the various resources.
- AWS S3 console: Create the data source S3 bucket and upload a file of your choice. Quite a few formats are supported natively by the built-in classifiers, which will be utilised by the crawlers. JSON, CSV, Avro, Parquet are all safe bets.
- AWS IAM Console: Create the appropriate policy and role for Glue to be able to access the resources in your pipeline. NOTE: if you are planning to use Redshift or an RDS instance, the AWS Managed Policy called AWSGlueServiceRole will not be sufficient, you will have to create an additional bespoke policy and assign it to the Role to include these resources as well. The first two steps below this link will give you a good guidance.
https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html - AWS Glue Console: Create a Table in the AWS Glue Data Catalog using a Crawler, and point it to your file from point 1. Don’t forget to execute the Crawler! Verify that the crawler finished successfully, and you can see metadata like what is shown in the Data Catalog section image. The Crawler will require an IAM role, use the role in step 2.
- Database Workbench of your choice: Log in to your Redshift cluster and create a destination table (or your target database you want to transfer the data to).
- AWS Glue Console: Create a Connection in Glue to the Redshift Cluster (or to the Database) from point 4 using either the built-in AWS connectors or the generic JDBC one. Make sure the user you are using to set up the Connection (if it is different from what you used to created the destination table) has access to your destination database table.
- AWS Glue Console: Create another Table in Data Catalog using a second Crawler, utilising the database connection created above. You can specify which database tables you want to gather metadata on using appropriate schema and/or table name prefixes. Execute the Crawler, and verify that your second meta-data table entry in present.
- AWS Glue Console: At this stage you have metadata available on both the source and the destination, so you can develop a job using the simple Drag&Drop interface. Follow the screens, you can leave most of the things as default. Apart from naming the different entities, make sure to:
++ pick the IAM role you have created in point 2
++ choose the option: “A proposed script generated by AWS Glue”
++ map the fields from the source to the fields in the destination, preview the code generated if you are interested….. The only remining tasks are saving and executing the job! - Optional: AWS Lambda Console: Develop and simple Lambda function which is triggered by the S3 Put event, and kicks off your newly developed Glue ETL Job every time a new file arrives.