Deploying Large Language Models with SageMaker Asynchronous Inference

Author:Murphy | View: 22029 | Time: 2025-03-22 23:06:46

LLMs continue to burst in popularity and so do the number of ways to host and deploy them for inference. The challenges with LLM hosting have been well documented particularly due to the size of the model and ensuring optimal usage of the hardware that they are deployed on. LLM use-cases also vary. Some may require real-time based response times, while others have a more near real-time based latency requirement.

For the latter and for more offline inference use-cases, SageMaker Asynchronous Inference serves as a great option. With Asynchronous Inference, as the name suggests we focus on a more near real-time based workload where the latency is not necessary super strict, but still requires an active endpoint that can be invoked and scaled as necessary. Specifically within LLMs these types of workloads are becoming more and more popular with use-cases such as Content Editing/Generation, Summarization, and more. All of these workloads don't need sub-second responses, but still require a timely inference that they can invoke as needed as opposed to a fully offline nature such as that of a SageMaker Batch Transform.

In this example, we'll take a look at how we can use the HuggingFace Text Generation Inference Server in conjunction with SageMaker Asynchronous Endpoints to host the Flan-T-5-XXL Model.

NOTE: This article assumes a basic understanding of Python, LLMs, and Amazon SageMaker. To get started with Amazon SageMaker Inference, I would reference the following guide. We will cover the basics of SageMaker Asynchronous Inference, but for a deeper introduction refer to the starter example here that we will be building off of.

DISCLAIMER: I am a Machine Learning Architect at AWS and my opinions are my own.

When to use SageMaker Asynchronous Inference
TGI Asynchronous Inference Implementation a. Setup & Endpoint Deployment b. Asynchronous Inference Invocation c. AutoScaling Setup
Additional Resources & Conclusion

1. When to use SageMaker Asynchronous Inference

SageMaker Inference currently has four different options that you can utilize depending on your use-case. There are three endpoint driven options and one for complete offline inference:

Endpoint Driven Options:
SageMaker Real-Time Inference: For sub-second/millisecond latency and high throughput workloads. This endpoint can utilize CPU, GPU, or Inferentia Chips, and apply AutoScaling at a hardware level to scale up your infrastructure. Common use-cases include Ad-Tech based predictions, Real-Time Chatbots, and more.
SageMaker Serverless Inference: Best for spiky, intermittent workloads that can tolerate a cold-start (can mitigate with Provisioned Concurrency). Here you don't manage any infrastructure behind your endpoint and scaling is taken care off for you.
SageMaker Asynchronous Inference: This is the option we'll be considering today, and with Asynchronous Inference you ideally have near real-time based latency requirements and will still have dedicated hardware that you define for your endpoint. Unlike Real-Time Inference, however with Asynchronous Inference you have the option to be able to scale down to 0 instances behind an endpoint. With Asynchronous Inference, you can use the built-in queue to manage your requests and scale depending on how full this queue is.
Offline Inference:
SageMaker Batch Transform: Best to utilize when you have a dataset and just need the outputs returned as a dataset. There is no persistent endpoint and this is completely offline inference. A common use-case is to run Batch Transform jobs at a certain scheduled time if you know you have datasets that you need inference to be conducted on at a certain cadence.

For this use-case we specifically focus on Asynchronous Inference, an option that can almost serve as a bit of a marriage between Batch Transform and Real-Time Inference. Due to it's near real-time capabilities and ability to scale down to zero instances, it can serve as an efficient way to host LLMs for use-cases that might not require an immediate generation.

Examples of these use-cases include: Summarization, Content Generation, Editing, and more. These use-cases all may require invocation at a variable time thus leading to the need for a persistent endpoint, but also don't require the response time of Real-Time Inference. With Asynchronous Inference we can address these types of use-cases from both a performance and cost based angle.

For today's example, we'll take the popular Flan Model and adopt it to Sagemaker Asynchronous Inference. Creating a SageMaker Asynchronous Inference Endpoint is very similar to our Real-Time Endpoint creation. The main varying point is the invocation requires an S3 path to your input data rather than passing in the payload directly as you do with real-time inference.

General Asynchronous Architecture (By Author)

Note that within the Asynchronous Endpoint there is also an internal queue. For every inference, SageMaker queues the request and returns the output location in S3. When configuring AutoScaling for Asynchronous Endpoints you can scale depending on the number of requests within this queue. You can also optionally integrate SNS Topics to receive successful or erroneous inference notifications as opposed to directly polling from the output S3 path.

Now that we've got a better understanding of what Asynchronous Inference is let's get to the implementation!

2. TGI Asynchronous Inference Implementation

We'll be working in the new SageMaker Studio environment on a Base Python3 Kernel with a ml.c5.xlarge instance. For working with Asynchronous Inference we will be using the familiar Boto3 AWS Python SDK and higher-level SageMaker Python SDK for orchestration.

a. Setup & Endpoint Deployment

With Asynchronous Inference we first need to define an output S3 path for where our data will be stored.

sagemaker_session = sagemaker.Session()
default_bucket = sagemaker_session.default_bucket()
bucket_prefix = "async-llm-output"
async_output_path = f"s3://{default_bucket}/{bucket_prefix}/output"
print(f"My model inference outputs will be stored at this S3 path: {async_output_path}")

We can then take this S3 Path and specify an Asynchronous Inference configuration. In this case we don't specify an SNS topic, but you optionally can include this incase you want to be notified of your successful and erroneous invocations through this service.

from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig

async_config = AsyncInferenceConfig(
    output_path=async_output_path,
    max_concurrent_invocations_per_instance=10,
    # Optionally specify Amazon SNS topics
    # notification_config = {
    # "SuccessTopic": "arn:aws:sns:::",
    # "ErrorTopic": "arn:aws:sns:::",
    # }
)

Once this has been defined we can directly grab the deploy SageMaker code from the HuggingFace Hub link for the Flan T-5-XXL Model. This code will be utilizing the Text Generation Inference model server under the hood which comes with built-in optimizations such as Tensor Parallelism.

# directly grab huggingface hub deploy code and add async config
hub = {
   'HF_MODEL_ID':'google/flan-t5-xxl',
   'SM_NUM_GPUS': json.dumps(4)
}

huggingface_model = HuggingFaceModel(
   image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
   env=hub,
   role=role, 
)

We then can deploy this SageMaker Model object along with our Asynchronous Config to create an endpoint.

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
 initial_instance_count=1,
 instance_type="ml.g5.12xlarge",
 container_startup_health_check_timeout=300,
 async_inference_config=async_config
)

In the Console/UI you can see that the Endpoint is specified to be an Asynchronous type.

b. Asynchronous Inference Invocation

To invoke the endpoint with a singular payload you can use the same ".predict" built-in method as you would with Real-Time Inference.

# singular invocation

payload = "What is the capitol of the United States?"
input_data = {
    "inputs": payload,
    "parameters": {
        "early_stopping": True,
        "length_penalty": 2.0,
        "max_new_tokens": 50,
        "temperature": 1,
        "min_length": 10,
        "no_repeat_ngram_size": 3,
        },
}
predictor.predict(input_data)

If we want to scale Asynchronous Inference for realistic use-cases however, we will want to invoke the endpoint with data from S3. The idea with Asynchronous Inference is you can have multiple requests stored within your input S3 bucket and the invocation will return a corresponding S3 output file with the result for each data point. Note once again that this is different from Batch Transform where the entire dataset is processed and there is no endpoint to invoke on demand.

Here we create an artificial dataset with the same datapoint and push it to S3 for the sake of this demo. The following code will create a local directory with your input files:

import json
import os

output_directory = 'inputs'
os.makedirs(output_directory, exist_ok=True)

for i in range(1, 20):
    json_data = [input_data.copy()]

    file_path = os.path.join(output_directory, f'input_{i}.jsonl')
    with open(file_path, 'w') as input_file:
        for line in json_data:
            json.dump(line, input_file)
            input_file.write('n')

We use the already provided utility function in the intro notebook to upload these local files to S3 for inference.

def upload_file(input_location):
    prefix = f"{bucket_prefix}/input"
    return sagemaker_session.upload_data(
        input_location,
        bucket=default_bucket,
        key_prefix=prefix,
        extra_args={"ContentType": "application/json"} #make sure to specify
    )

sample_data_point = upload_file("inputs/input_1.jsonl")
print(f"Sample data point uploaded: {sample_data_point}")

We can then run a sample inference on this S3 path with the "invoke_endpoint_async" API call via the Boto3 SDK.

import boto3
runtime = boto3.client("sagemaker-runtime")

response = runtime.invoke_endpoint_async(
    EndpointName=predictor.endpoint_name,
    InputLocation=sample_data_point,
    Accept='application/json',
    ContentType="application/json"
)

output_location = response["OutputLocation"]
print(f"OutputLocation: {output_location}")

Once again we use a provided utility function to observe the outputs of the output file once it has been generated. Note this may take a little with the LLM to actually perform inference and then have the S3 file created. Thus, in the provided function we poll the data file till there are contents present that we can display.

import urllib, time
from botocore.exceptions import ClientError

# function reference/credit: https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough-SageMaker-Python-SDK.ipynb
def get_output(output_location):
    output_url = urllib.parse.urlparse(output_location)
    bucket = output_url.netloc
    key = output_url.path[1:]
    while True:
        try:
            return sagemaker_session.read_s3_file(bucket=output_url.netloc, key_prefix=output_url.path[1:])
        except ClientError as e:
            if e.response["Error"]["Code"] == "NoSuchKey":
                print("waiting for output...")
                time.sleep(60)
                continue
            raise

output = get_output(output_location)
print(f"Output: {output}")

We can then run through all our sample data points to conduct inference across all input files:

inferences = []
for i in range(1,20):
    input_file = f"inputs/input_{i}.jsonl"
    input_file_s3_location = upload_file(input_file)
    print(f"Invoking Endpoint with {input_file}")
    async_response = predictor.predict_async(input_path=input_file_s3_location)
    output_location = async_response.output_path
    print(output_location)
    inferences += [(input_file, output_location)]
    time.sleep(0.5)

for input_file, output_location in inferences:
    output = get_output(output_location)
    print(f"Input File: {input_file}, Output: {output}")

c. AutoScaling Setup

With Asynchronous Inference, AutoScaling is also setup via Application AutoScaling as it is with Real-Time Inference. The difference here is there are new metrics that you can scale on.

As we understand within Asynchronous Inference there is already an internal queue that is implemented. For AutoScaling, we can scale in and out depending on the number of items in this queue which is captured as the following CloudWatch metric: "ApproximateBackLogSize". These requests are either currently being processed or are yet to be processed.

We set up the policy very similarly to how we do with Real-Time Inference once again using the Boto3 SDK. Notice that we define our minimum instance count to zero, this is only supported in Asynchronous Inference.

client = boto3.client(
    "application-autoscaling"
)  # Common class representing Application Auto Scaling for SageMaker amongst other services

resource_id = (
    "endpoint/" + predictor.endpoint_name + "/variant/" + "AllTraffic"
)  # This is the format in which application autoscaling references the endpoint

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=5,
)

Once we have specified the minimum and maximum instance count, we can then define the cooldown periods for both scaling out and in. Notice that here we specify the "MetricName" to be the "ApproximateBackLogSize" metric.

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker",  # The namespace of the AWS service that provides the resource.
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
    PolicyType="TargetTrackingScaling",  # 'StepScaling'|'TargetTrackingScaling'
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # The target value for the metric. - here the metric is - SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": predictor.endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,  # The cooldown period helps you prevent your Auto Scaling group from launching or terminating
        # additional instances before the effects of previous activities are visible.
        # You can configure the length of time based on your instance startup time or other application needs.
        # ScaleInCooldown - The amount of time, in seconds, after a scale in activity completes before another scale in activity can start.
        "ScaleOutCooldown": 100  # ScaleOutCooldown - The amount of time, in seconds, after a scale out activity completes before another scale out activity can start.
        # 'DisableScaleIn': True|False - ndicates whether scale in by the target tracking policy is disabled.
        # If the value is true , scale in is disabled and the target tracking policy won't remove capacity from the scalable resource.
    },
)

To test AutoScaling we can send requests over a certain duration, note that according to our scaling policy the target value will be five invocations or requests that are still being or yet to be processed within the queue behind our endpoint.

request_duration = 60 * 15 # 15 minutes
end_time = time.time() + request_duration
print(f"test will run for {request_duration} seconds")
while time.time() < end_time:
    predictor.predict(input_data)

Note that after we've stopped sending requests for a while, the instance count will scale down to zero, with the queue being completely empty:

3. Additional Resources & Conclusion

SageMaker-Deployment/LLM-Hosting/Async-Inference-LLM/async-llm-tgi.ipynb at master ·…

The code for the entire example can be found at the link above. SageMaker Asynchronous Inference is a feature that can be used for specific LLM use-cases that aren't fully Real-Time or Batch in nature. I hope this article was a useful introduction into another way to host and deploy LLMs for inference at scale, stay tuned for more content in this area!

As always thank you for reading and feel free to leave any feedback.

If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter.

Tags: AWS Generative Ai Tools Llm NLP Sagemaker