SageMakerVariantInvocationsPerInstance

SageMakerVariantInvocationsPerInstance is a predefined CloudWatch metric used for instance-based auto-scaling of Amazon SageMaker endpoints. Let's break down what that means:

Understanding the Metric

What it measures: This metric tracks the average number of times per minute that each instance associated with a particular endpoint variant is invoked (receives a request for inference).
Instance-based scaling: Instead of looking at the entire endpoint's traffic, it focuses on the load at the individual instance level. This allows for more granular scaling adjustments.

Using it for Auto-Scaling

SageMakerVariantInvocationsPerInstance is typically used in Target Tracking Scaling Policies where you want your endpoint's capacity to adapt dynamically to incoming inference requests. Here's how:

Target Value: You decide on a target average number of invocations per instance you consider optimal. Say your application works best when each instance handles around 50 requests per minute.
Scaling policy: You create an auto-scaling policy that will:
- Scale Out (add instances): If the average invocations per instance consistently exceed your target value (e.g., if it stays above 50).
- Scale In (remove instances): If the average invocations per instance consistently fall below your target value.

Why Use This Metric

Granular Control: Enables better resource utilization by scaling based on the actual load distribution across your instances.
Responsiveness: Ensures adequate instances to handle traffic without being overly wasteful if some instances get consistently lower traffic.
Ease of Use: A predefined metric, simplifying the setup of instance-based auto-scaling within SageMaker.

Important Points

Cooldown periods: Include cooldown periods within scaling policies to prevent rapid, aggressive scaling based on temporary fluctuations.
Combination with other metrics: May sometimes be used in conjunction with other metrics (CPU/memory usage) for finer control.
Suitability: It's well-suited for use cases where load per instance can significantly vary (uneven inference request distribution).