Core ML Transforms
- ApplyMapping: Maps values in data columns using user-defined mappings, replacing old with new values.
- FindMatches: Identifies duplicate or near-duplicate records within datasets. Helpful for deduplication and entity resolution tasks.
- Custom Transform: Allows you to write your own ML logic using PySpark for specific machine learning tasks.
Transforms for Common ML Tasks
These transforms are often built upon the Custom Transform, providing common ML functionalities:
- Imputation: Handles missing values in datasets using strategies like mean, median, or constant imputation.
- Normalization/Standardization: Rescales features to have a common range or distribution, helpful for many machine learning algorithms.
- One-hot Encoding: Converts categorical features into a numerical format suitable for many machine learning models.
- Principal Component Analysis (PCA): Used for dimensionality reduction by extracting the most important components (features) from a dataset.
Important Notes:
- Availability: The full list of available ML transforms might slightly vary depending on your AWS region.
- Customizability: While pre-built transforms exist, the Custom Transform provides significant flexibility for implementing custom machine learning algorithms within your data pipelines.
- Tutorials and Examples: AWS Glue provides tutorials and examples to help you get started using these transforms: https://docs.aws.amazon.com/glue/latest/dg/machine-learning-transform-tutorial.html
How to Use ML Transforms
- Crawl your data: Create a Glue crawler to analyze your data source and infer its schema.
- Create an ML transform: Choose the desired transform and configure its parameters.
- Create a Glue job: Define a job that includes your data source, the ML transform, and a target for the transformed output.
- Run the job: Execute your Glue job to apply the transformation to your data.