SageMakerVariantInvocationsPerInstance is a predefined CloudWatch metric used for instance-based auto-scaling of Amazon SageMaker endpoints. Let's break down what that means:
Understanding the Metric
- What it measures: This metric tracks the average number of times per minute that each instance associated with a particular endpoint variant is invoked (receives a request for inference).
- Instance-based scaling: Instead of looking at the entire endpoint's traffic, it focuses on the load at the individual instance level. This allows for more granular scaling adjustments.
Using it for Auto-Scaling
SageMakerVariantInvocationsPerInstance is typically used in Target Tracking Scaling Policies where you want your endpoint's capacity to adapt dynamically to incoming inference requests. Here's how:
- Target Value: You decide on a target average number of invocations per instance you consider optimal. Say your application works best when each instance handles around 50 requests per minute.
- Scaling policy: You create an auto-scaling policy that will:
- Scale Out (add instances): If the average invocations per instance consistently exceed your target value (e.g., if it stays above 50).
- Scale In (remove instances): If the average invocations per instance consistently fall below your target value.
Why Use This Metric
- Granular Control: Enables better resource utilization by scaling based on the actual load distribution across your instances.
- Responsiveness: Ensures adequate instances to handle traffic without being overly wasteful if some instances get consistently lower traffic.
- Ease of Use: A predefined metric, simplifying the setup of instance-based auto-scaling within SageMaker.
Important Points
- Cooldown periods: Include cooldown periods within scaling policies to prevent rapid, aggressive scaling based on temporary fluctuations.
- Combination with other metrics: May sometimes be used in conjunction with other metrics (CPU/memory usage) for finer control.
- Suitability: It's well-suited for use cases where load per instance can significantly vary (uneven inference request distribution).