Scalable OCR Pipelines using AWS
OCR (Optical Character Recognition) systems are often viewed as transforming a document into a computer-readable format by extracting its content. However, OCR is the process of identifying characters in an image containing only text, so any image containing more information (or is not correctly rotated, skewed etc) requires some pre- and post-processing services to reach a meaningful output. In reality, most images fed into an OCR pipeline are not pure text, nor are they correctly rotated. Instead, images may consist of both structured and unstructured information with text and images appearing next to each other. This means the OCR pipeline must include capabilities to rotate incoming documents, identify regions of interest (RoI), detect text, perform OCR, and validate the results.
This article looks at how to structure your model pipeline to create a production-ready OCR pipeline for document text extraction.
5 Steps
The OCR process can be broken down into the following five steps:
- Rotation and Skew Correction – This step rotates and corrects the skew of images to create an upright and squared page. Classical computer vision techniques are often sufficient for this task.
- Region Detection – This step identifies regions of interest to provide deeper context and meaning to the detected text. This requires the use of an ML model.
- Text Detection – This step separates text from other images within the document to ensure that the OCR engine only processes sections containing text. This requires the use of an ML model.
- OCR – This step performs the optical character recognition on the text-containing section of the document. The performance of the OCR process can vary depending on the size of the corpus (e.g., one-line, multi-line, numbers-only, character-only), so multiple runs using multiple OCR engines may be necessary. This requires a set of ML models.
- Validation – This step adds business context to the downstream services to provide end-users with insights into the performance of the OCR pipeline, potential issues, and where to focus additional attention. A basic validation process can be established using manual rules, but more advanced options exist using statistical analysis and ML models.
With these five steps in place, a fully functioning pipeline can be constructed into a system component.
OCR Pipeline Components
Before diving into the pipeline implementation, it is important to design the system to be highly de-coupled from any existing upstream and downstream systems. This will allow engineers and researchers to quickly iterate upon and adjust the pipeline without being dependent on any other components (or teams), something that is very important in the initial stages of building out an OCR pipeline. To do so, separating the system into a component with a clear API for upstream and downstream services is key.
There are a few options to do this, but one pattern that works well with the rest of AWS frameworks and Architecture patterns is to use the async event-driven architecture pattern. This allows for scalable, testable and maintainable components that are highly independent of upstream and downstream services – allowing teams to operate independently increasing development speeds greatly and enabling resilient systems from the start.

To efficiently trigger the Ocr component, we use AWS EventBridge rules to pick up events on the event bus, created by upstream producers. The component then processes requests and emits events, in a fan-out pattern, for downstream consumers to consume. This event-driven pattern is perfect for research-driven development, like OCR pipelines, as it allows for internal architecture changes without impacting external services or forcing complex team collaboration processes when performing API migrations.
Using event-driven architecture also makes it easy to add component tests to our OCR pipeline. By emitting "test" events, the component can be run even in production without affecting any real production requests, providing deep insights into model performance. The test events can then be picked up by any existing test infrastructure, but ignored by other downstream production systems. The results from the test run can be compared against a ground truth dataset, creating an analysis that can be displayed to engineers, researchers and stakeholders to give them deep insights into the model performance in any SDLC (dev, pre-prod, production etc). This is crucial for understanding whether the full OCR pipeline is operating as expected and continuously iterating to improve its performance.
As the OCR component evolves, we have constructed the main component infrastructure using a step function. This provides great control over the orchestration of different services used in the OCR Pipeline. The development team can easily A/B test, swap out or adjust different internal services without impacting external services. The step function also comes with built-in re-tries and monitorability, making it easy to debug and monitor the OCR pipeline in both development and production environments.
1 – Lambda inference
Diving further into the OCR pipeline, the step function is responsible for orchestrating the pipeline micro-services. To allow multiple engineers and researchers to operate on the pipeline at the same time, it is sensible to separate the pipeline into the 5 steps presented earlier in this article (rotation, region detection, text detection, OCR, and validation). This makes it easy to simultaneously improve single operations of the service while sustaining an easy overview of the services.
To implement this, a simple first version of the pipeline is here built using lambdas that collectively perform the OCR tasks required (see diagram above). For someone with previous experience in running ML model inference, using Lambdas might be counterintuitive, but since AWS added the capability to run docker containers in Lambdas (allowing up to 10 GB of memory usage and any custom environment needed for the ML inference framework) this option has proven to be a reasonable one, with low running and maintenance costs while being easy to set-up.
Cons
Using lambdas for model inference allows for a very scalable system design that allows engineers to build up a production-ready pipeline quickly. There are a few drawbacks to this design though;
- Lambdas does not support GPU
- Hard to execute the full pipeline locally to explore new pipeline configurations
- Long latency as of lambda cold-starts
- Max 10 GB large models per lambda
Pros
If these cons are not critical requirements for the pipeline, this option can be considered as it gives a few pros over the other options we'll look at down below
- Cost efficient
- No stand-by costs (i.e. the costs are directly related to the number of requests that occur to the pipeline)
- Cheap to build
- Easy to maintain
- Native resilience, traceability and reproducibility
- High throughput
So while this approach has a few (sometimes critical) drawbacks, it can be a powerful start to quickly get an operational production pipeline in place, that easily over time can be improved and extended beyond the basics.
2 – SageMaker endpoints

If the limitations of the above design are not acceptable, one must dive deeper into the requirements to find a final solution. This might require managed instances, like EC2, to gain access to GPU hardware, decreased latency and increased model artefact sizes. But this comes with a larger maintenance cost, as scaling, uptime and running costs will be harder to manage and keep under control. Thankfully, some of the extra maintenance and complexity of this solution can be resolved by using SageMakers inference services, where 3 options are available;
- Real-time inference
- Async inference
- Serverless inference
While all these three options allow for easier maintenance than a self-managed ec2 instance, they serve different use cases. One can easily start with one endpoint type, to later change it to another. One might quickly jump onto using the SageMaker serverless option, but this one suffers from the same issues as using the simpler Lambda container option, with cold-starts, CPU only and a maximum of 10 GB container image size – while being more expensive than a lambda. Therefore, if only CPUs are needed, the lambda option is still recommended. If GPUs are needed one must choose between the async and real-time inference options. For costs, the async option might be more optimal to start with if possible. More details on the endpoint types are available in the SageMaker inference docs.
When designing the system with SageMaker endpoints in mind (see above diagram), one must carefully choose the services that require GPU, lower latency and/or a stable load to be moved into this more expensive solution. Both the classical computer vision service "rotation" and the validation service can continue to run inside lambdas as they in this example only require CPU, while region detection, text detection and the OCR engine are all heavier processes that would benefit from GPUs and are therefore moved into separate SageMaker endpoints. Each endpoint (region detection, text detection, OCR engine) requires a container image and a model artefact s3 URI that allows it to fetch the model artefact from s3 on startup.
For this architecture, one must more carefully design and maintain scaling the services to optimize throughput and latency against cost. To help with this, SageMaker does have features for autoscaling, allowing detailed fine-tuning that simply is not there for lambdas. As GPUs can be extremely expensive, monitoring and finetuning these autoscaling policies over time can be crucial to keep the service SLA meeting expectations while keeping costs down.
Using the SageMaker async/real-time endpoints can solve two apparent issues with the lambda inference approach, (1) GPU inference and (2) latency, but inherently adds more complexity and cost over lambdas. This option is valid when GPUs or low latency is required, while SageMaker serverless more seems like a wrapper around containerized lambdas, and therefore should be avoided.
Pros
- GPU support
- No cold starts, allowing for lower latency
Cons
- Expensive (compared to CPUs)
- More complex to build, compared to lambdas
3 – SageMaker Triton Server
The primary challenge for researchers with the above examples is the difficulty of easily running the pipeline locally. This problem becomes increasingly significant as the pipeline matures, and a change in an ML model can have an impact on the whole OCR pipeline performance.
For instance, let's consider the scenario where one wants to re-train the rotation service. Such a change could improve the local performance of the service, but would potentially have adverse effects on the text recognition service as the output from the rotation algorithm would be different. The hard part is to detect this data drift in any of the services in the full pipeline. To do so, one must run the pipeline in its entirety – measuring each service performance as well as the full OCR component performance. There are multiple ways to do this, but the most convenient one is to give the researchers the possibility to perform some of these tests locally.
To solve this problem, inference graphs are a great tool. On AWS this is possibly best done via the NVIDIA Triton inference container image served via SageMaker. By using Triton's capability to run multiple different inference frameworks, and it's possible to build out ensembles (pipelines) of different models, one can extend a powerful system that can be executed both locally and in production. It also adds convenient features for batching, GPU optimization etc. that can be useful in advanced production systems with complex requirements on throughput and latency.
An ensemble step in Triton can contain both ML models as well as normal pre/post-processing code, allowing for powerful configurations with business logic embedded together with the ML models themselves. This allows our design to merge the three steps requiring GPU (region detection, text detection and OCR engine) to be merged into one. The same model artefacts, used in previous designs, can continue to be used (assuming they use a framework supported by Triton), but we are required to add an ensemble configuration, describing which models the Triton server should load on start, and how to run each model. To understand the details more, and how to add Triton to SageMaker, check the AWS documentation here.

So far, the rotation and validation services have been kept separate. As they only require CPU (in this example), they are not required to be merged into the Triton ensemble. Instead, these services could be converted into sharable code packages that can be used locally to build a simple pipeline shell around the full pipeline.
The same autoscaling capabilities described in the endpoint design can be used to scale up and down the Triton servers described here, but one should consider that the larger the models are, and the more models are put onto a single machine, the slower the scaling becomes. This since each new instance that is spun up requires not only to download 1 model artefact from s3 but all 3 models each time, while in earlier designs each model server could scale independently, only downloading single models from s3 at a time.
Pros
- Optimized GPU support
- Fastest inference out of the 3 options
- Easier to run locally
Cons
- Most complex solution
- Expensive (compared to CPUs)
- Slower scaling capabilities
Conclusion
This article has gone through three example architecture designs, that can be used separately or in combination with each other, to build out a production-ready OCR pipeline. While lambdas are winning in their ease of use, they quickly fail to be a viable option when GPUs are needed or the cold-start issue does not meet latency requirements. This is where SageMaker comes in to do a great job allowing for more advanced infrastructure with GPU support – but it comes with the extra cost of complexity. SageMaker gives both the option for real-time inference as well as async inference while allowing for custom container images to be executed, giving the engineers the power to heavily customize both the infrastructure and environment in which the pipeline is executed. This allows the use of the much matured NIVIDIA Triton environment that enables both local executions of the pipeline while adding an extra layer of optimization on top of the hardware GPUs.
None of these options is a silver bullet, but instead inspiration for a final system design, that might involve parts of these patterns, in a mix between each other or as a whole.
Enjoyed this article?
Drop a comment, share, clap or follow me along on my journey as a software engineer.

