AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. With AWS Glue, users can automate time-consuming data preparation tasks, making it easier to organize, understand, and analyze their data.
AWS Glue discovers and catalogs metadata about your data in a centralized metadata repository known as the AWS Glue Data Catalog. This makes your data readily searchable and available for ETL.
Key features of AWS Glue include:
- Automated ETL Jobs: AWS Glue generates Python or Scala code for your ETL jobs that you can further customize if necessary.
- Data Cataloging: AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations.
- Serverless Architecture: There's no infrastructure to set up or manage with AWS Glue. You only pay for the compute resources you use.
- Developer Endpoints: This allows you to develop your ETL jobs interactively.
AWS Glue Data Catalog
The AWS Glue Data Catalog is a fully managed, Apache Hive Metastore compatible, metadata repository. It enables users to store, annotate, and share metadata in the AWS Cloud in a simple, scalable, and secure manner. Its key features include:
- Metadata Crawling: AWS Glue Data Catalog can automatically discover and catalog metadata from various data sources. It saves time and effort in creating ETL jobs.
- Centralized Metadata Store: It provides a unified view of all your data sources, making it easy for your data consumers to discover and access data.
- Fully Managed: AWS Glue Data Catalog is serverless and fully managed by AWS, which reduces your operational overhead.
- Integration with AWS Services: It seamlessly integrates with various AWS services, making it a central metadata repository for your AWS data analytics stack.
AWS Glue Data Catalog integrates with AWS Kinesis Data Firehose by allowing you to catalog the streaming data as it arrives in real-time. It can automatically infer schema of the incoming data stream and store it in the Data Catalog. Once the metadata is stored, it can be used to query the data in Amazon S3 using services like Amazon Athena and Amazon Redshift Spectrum.
ML Transforms
Development Endpoint
Crawlers
DataBref