Discover how AWS Glue's serverless capabilities bridge the gap between complex data sources while eliminating the overhead of traditional ETL infrastructure.
Written by Patten Smith, Senior Analytics Engineer at QuantSpark
AWS Glue is a proprietary AWS service facilitating the creation and management of ETL and ELT data pipelines. It is serverless and fully managed by AWS.
Glue is essentially a high-level abstraction of Apache Spark. It takes care of a lot of the complexity and cluster management under the hood, whilst also bringing a few bells and whistles of its own.
This allows the user to write regular Python (or even make use of the no-code GUI) and still take advantage of the speed and efficiency of Spark in a fully managed and serverless environment.
As the pipelines are coded in Python (or Scala for those that way inclined) it is highly flexible.
You are not limited to using SQL databases as your source. It can be used against object storage, such as s3, and noSQL DBs. This means it plays well with modern data lake architecture.
It comes with a number of useful out-the-box features:
You can create crawlers which assess data sources and create a data catalogue accordingly. This catalogue holds details of all the objects or tables of your source and what columns they contain. This allows you to stay on top of schema changes in your source.
It comes with pre-configured connection types for common data source connections such as Snowflake and Redshift.
You can orchestrate and schedule multiple glue jobs using workflows.
We recently deployed Glue to handle the integration of a new third-party Snowflake data source into a client’s Redshift database.
The client needed visibility on the data to report vital metrics pertaining to daily performance. Glue allowed seamless, highly flexible and scalable integration of the sources and provided a novel, lightweight solution to in-database transformation.
We employed the following architecture:
An AWS Glue job transfers data from the source database into S3. This is achieved by running a SQL query against the Snowflake and using an AWS Glue dynamic frame to transfer it.
Once in S3, the objects are assessed with a crawler. This returns all the columns of each table, allowing the pipeline to adapt to changes in the source schema.
Having determined the columns of the objects, we used another Glue job to load them to Redshift. Before the import, the Glue job runs an SQL pre-action to drop the destination table and replace it with an empty one with the new column set.
A final glue job to run DBT to transform the data into a usable data model that could be used to drive insights for the client.
We made full use of Glue’s flexibility to deploy DBT in this way. Configuring the job parameters to accept additional Python modules we installed DBT in the Glue environment and synced the ephemeral file system with a secure s3 bucket to import the models and various yaml files required by DBT. The out-of-the-box Redshift data connector made it easy for DBT to access the client’s database without sacrificing any security.
This allowed a very lightweight deployment of DBT that could sit in the same workflow as the rest of the Glue jobs. This meant Glue was both orchestrating and executing every aspect of our ELT pipeline.
Ready to unlock the full potential of your data architecture?
At QuantSpark, we specialise in designing and implementing efficient data pipelines that solve your most complex integration challenges. Whether you're struggling with disparate data sources, evolving schemas, or performance bottlenecks, our team of data experts can craft a tailored solution using AWS Glue and other cutting-edge technologies.
Contact QuantSpark today to transform your data operations and drive meaningful business insights from your most valuable asset—your data.