AWS Glue logo

AWS Glue

by Amazon Web Services · Since 2006
No reviews yet
Active1+ countriesCloud
Quick facts
VendorAmazon Web Services
Year launched2006
StatusActive
Location410 Terry Ave N, Seattle, WA 98019, US
Countries served1+
Languages10
Integrations
Free tier
Free trial
Contact salesYES

About AWS Glue

AWS Glue is a serverless data integration service from Amazon Web Services designed for discovering, preparing, integrating, and modernizing data. It provides features such as data cataloging, ETL (extract, change, load) capabilities, and schema management, so users can efficiently manage and process large datasets. AWS Glue helps automate the data preparation process, allowing developers and data engineers to focus on analysis rather than data wrangling. It is particularly useful for building data lakes and preparing data for analytics and machine learning. Key capabilities: data cataloging ETL capabilities schema management serverless architecture integration with various AWS services Best for: data engineers and analysts that need to manage and process large volumes of data efficiently.

AWS Glue by Amazon Web Services is a powerful, fully managed ETL (Extract, Transform, Load) service designed to simplify the process of preparing and integrating data for analytics, machine learning, and application development. Its primary purpose is to automate the tedious and complex tasks involved in data preparation, allowing users to catalog, clean, enrich, and move data seamlessly between data stores. With features like a data catalog, automated schema discovery, job scheduling, and support for serverless computing, AWS Glue significantly reduces the effort required to set up and maintain ETL pipelines. It’s designed for a broad range of users including data engineers, analysts, scientists, and business intelligence professionals working in cloud-first or hybrid data environments. The user interface of AWS Glue is clean, professional, and functional, though it may present a moderate learning curve for new users unfamiliar with AWS ecosystems. The dashboard is accessible via the AWS Management Console, providing access to key components such as Jobs, Crawlers, Data Catalogs, and Triggers.

Pros & Cons

What users like
  • +Generative AI Assistance: Speeds up ETL development and troubleshooting with AI-powered code authoring and debugging.
  • +Centralized Data Catalog: Provides a unified metadata repository for all your data, making it easy to discover and govern.
  • +Flexible ETL Development: Offers both visual (drag-and-drop) and code-based (Spark/Python/Scala) interfaces to suit different user preferences.
  • +Strong AWS Ecosystem Integration: Seamlessly works with other AWS services like S3, Lake Formation, Redshift, and SageMaker.
What users flag
  • Complexity for Beginners: While simplifying many aspects, the breadth of features and concepts can still have a learning curve for new users.
  • Debugging Challenges: Debugging Spark jobs in a serverless environment can sometimes be more complex compared to traditional, persistent clusters.
  • Dependency on AWS Ecosystem: Primarily designed for users within the AWS cloud environment, potentially less straightforward for multi-cloud or hybrid setups not heavily integrated with AWS.

Features

Key features

Serverless Data Integration
Automates infrastructure provisioning and management (workers, scaling), allowing users to focus solely on data integration logic without managing servers.
Comprehensive Data Catalog
Provides a centralized metadata repository to quickly discover and catalog data across AWS, on-premises, and other cloud environments, making it instantly available for querying and transformation.
Visual & Code-Based ETL Development
Offers both a drag-and-drop visual interface (AWS Glue Studio) for building ETL flows and the flexibility of code-based development for Apache Spark jobs (Python/Scala).
Built-in Generative AI Capabilities
Includes intelligent assistance for ETL authoring and Spark troubleshooting, helping to modernize Apache Spark jobs and accelerate development.
Scalability On-Demand
Automatically scales computational resources up or down based on workload needs, ensuring efficient processing of data at any scale.
Support for Diverse Data Sources & Formats
Connects to over 100 different data sources and supports various file types (e.g., CSV, JSON, Parquet, ORC, XML, Excel, Tableau Hyper), including compressed formats, for both source and target.
Flexible Data Processing Frameworks
Supports various data processing paradigms, including ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform), and handles different workloads such as batch, micro-batch, and streaming.
Interactive Sessions
Enables data engineers to interactively explore and prepare data using their preferred IDE or notebook environment, facilitating experimentation and development.
Integration with AWS Lake Formation
Enhances security and data governance for data lakes by supporting read and write operations on Lake Formation registered tables with full table access for Spark jobs.
Optimization for Apache Iceberg Tables
Supports sort and z-order compaction for Apache Iceberg tables in Amazon S3, improving query performance and reducing costs.

Additional features

Serverless Data Integration Service
Operates without requiring users to provision, manage, or scale servers, automatically handling infrastructure for data integration workloads.
Comprehensive Data Catalog
Provides a centralized, persistent metadata repository that stores schemas, table definitions, and locations of data across AWS, on-premises, and other clouds, making data easily discoverable.
Automated Schema Discovery (Crawlers)
Utilizes "crawlers" to automatically connect to data sources, infer schemas, and populate the Data Catalog, including detecting partitions and schema evolution.
Visual ETL Development (AWS Glue Studio)
Offers a drag-and-drop graphical interface for visually building, running, and monitoring ETL workflows without extensive coding.
Code-Based ETL Development
Allows data engineers to write and customize ETL scripts using Apache Spark (PySpark or Scala) in a flexible development environment.
Built-in Generative AI Capabilities
Features AI-powered assistance for generating ETL code from natural language descriptions, modernizing Apache Spark jobs (e.g., upgrading versions), and accelerating Spark troubleshooting with intelligent diagnostics and root cause analysis.
On-Demand Scalability
Automatically scales compute resources to process data from gigabytes to petabytes, adapting to workload demands without manual intervention.
Reduced Operational Complexity
Minimizes administrative overhead by providing serverless data pipelines with built-in scheduling, monitoring, and auto-scaling.
Broad Data Source Connectivity
Connects to over 100 diverse data sources, including various AWS services (S3, RDS, Redshift, DynamoDB, Kinesis, MSK), on-premises databases (MySQL, Oracle, SQL Server, PostgreSQL), and SaaS applications (Salesforce, SAP).
Flexible Data Processing Frameworks
Supports various data processing patterns, including Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT), for batch, micro-batch, and streaming workloads.
Interactive Sessions
Enables data engineers and scientists to interactively explore, experiment with, and prepare data using their preferred Integrated Development Environments (IDEs) or notebooks, powered by a scalable Apache Spark backend.
Integration with Amazon SageMaker
Seamlessly integrates data preparation and ETL capabilities directly into the Amazon SageMaker environment, including SageMaker Unified Studio and Visual ETL.
AWS Glue Data Catalog Usage Metrics (Amazon CloudWatch)
Provides detailed API usage metrics for the Data Catalog in CloudWatch, allowing users to monitor, troubleshoot, and optimize their lakehouse runtime API usage.
Enhanced Apache Spark Capabilities for AWS Lake Formation
Supports full read and write (DML operations like CREATE, ALTER, DELETE, UPDATE, MERGE INTO) access for AWS Glue 5.0 Apache Spark jobs on AWS Lake Formation registered tables when the job role has full table access.
Apache Iceberg Table Optimization
Offers managed compaction, sort, and z-order compaction for Apache Iceberg tables in Amazon S3 via the AWS Glue Data Catalog, automatically removing old data files and improving query performance.
Support for Additional File Types & Output Options
AWS Glue Studio now supports more compressed file types (LZ4, SNAPPY, DEFLATE, etc.), Excel files as sources, and XML/Tableau Hyper files as targets, along with the option to specify the number of output files (including single file output).
Unified Scheduling Experience (Amazon SageMaker)
Streamlines the scheduling of Visual ETL flows and queries directly from the SageMaker interface using Amazon EventBridge Scheduler.
Amazon Redshift Zero-ETL Integrations (History Mode)
Supports "history mode" for zero-ETL integrations with eight third-party SaaS applications and various AWS databases, allowing tracking of historical data changes for analytics without traditional ETL processes.
Machine Learning Transformations (FindMatches)
Provides ML-powered transformations like FindMatches for record deduplication and entity resolution.
Job Scheduling and Orchestration
Allows setting up time-based (cron expressions) or event-based triggers for ETL jobs and orchestrating complex workflows.
Data Quality and Cleansing
Provides tools and capabilities to clean and prepare data, including handling missing values, standardizing column names, and converting data types.
Data Profiling and Classification
Permits profiling of datasets for increased organization and classification, both manually and with machine learning.
Data Access Control and Governance
Integrates with AWS IAM and Lake Formation for fine-grained access control, data encryption, and auditing.
Development Endpoints
Provides a dedicated environment for developing and testing ETL scripts interactively using notebooks.

Pricing

Free trial
Free version
Request a quote
Promo Offer

Countries & Languages

1
Countries served
10
Interface languages
6
Billing currencies

Available in

All Countries.

Interface languages

EnglishSpanishFrenchItalianGermanPortugueseDutchChineseJapaneseKorean

Billing currencies

🇺🇸USD🇪🇺EUR🇬🇧GBP🇦🇺AUD🇯🇵JPY🇨🇦CAD

No reviews yet

Be the first to drop a review

Alternatives to AWS Glue

Synatic Data Integration Platform logo

Synatic Data Integration Platform

Synatic Data Integration Platform is a data integration software from Synatic that provides a comprehensive…

Synatic logo

Synatic

Synatic is a unified platform from Synatic that enables the business to integrate and automate…

Airbyte logo

Airbyte

Airbyte is a data integration platform that helps users move data from various sources like…

BOARD Connector logo

BOARD Connector

Board Connector is a specialized "power-bridge" for any organization using the Board platform alongside SAP.

Conecta HUB logo

Conecta HUB

Conecta HUB is a robust data integration and automation platform developed by Conecta Software, designed…

CozyRoc SSIS+ 1.5 Library logo

CozyRoc SSIS+ 1.5 Library

CozyRoc SSIS+ is the "Swiss Army Knife" for SQL Server professionals. It successfully bridges the…

Often compared with AWS Glue

Compare any two tools →
Synatic Data Integration Platform logo
Synatic Data Integration Platform
API Management
0.0
Synatic logo
Synatic
API Management
0.0
Airbyte logo
Airbyte
ETL
0.0
BOARD Connector logo
BOARD Connector
ETL
0.0