Performing OCR Task with Claude 3 Haiku (Part 1)

This blog is a multi-part series covering OCR using Bedrock and Claude 3 Haiku. In part 1, we will explore a proof-of-concept OCR project that utilises the vision capabilities of Anthropic’s Claude 3 Haiku to extract key information from invoices. 

History

Back in March 2024, Anthropic announced the release of its next-generation AI model, Claude 3, which included three advanced models optimised for different use cases: Haiku, Sonnet and Opus. These models offered improved performance, accuracy and reliability, making them suitable for various applications, including enterprise use cases. 

Quick Overview of Claude 3 model family

The Claude 3 models have been trained on a diverse range of data formats, including language, images, charts and diagrams, allowing them to understand and generate multimedia content. This capability enabled businesses to build generative AI applications that can integrate diverse data sources and solve complex, cross-domain problems. 

The Claude 3 model family offers several features: 

  • Haiku: Fastest and most cost-effective model, ideal for near-instant responsiveness 
  • Sonnet: 2x faster than Claude 2 and 2.1, with higher intelligence and ideal balance between intelligence and speed 
  • Opus: Most advanced model, with deep reasoning, advanced math and coding abilities, and top-level performance on complex tasks 
  • Vision capabilities: Understands structured and unstructured data across different formats, including language, images, charts and diagrams 

Claude 3 models are easily accessible to customers via Amazon Bedrock to build scalable, secure and personalised AI applications. 

OCR Use Case

Since the introduction of Large Language Models (LLMs) to the mainstream, Optical Character Recognition (OCR) technology has made significant strides recently. This is where the Claude 3 family model shines the most as they can interpret complex visual data. 

Let us have a look at our sample invoice document saved as an image (*.jpg or *.png): 

Looking at the image above, we want to build an automated process to extract key information from our document such as: 

  • INVOICE_NUMBER 
  • CUSTOMER_NUMBER 
  • CUSTOMER_NAME 
  • VENDOR_NAME 
  • INVOICE_AMOUNT 
  • INVOICE_DATE 
  • VENDOR_ADDRESS.  

A solution can be implemented as follows: 

import boto3
import re
import json
import base64
import pprint
from litellm import completion

s3 = boto3.client(‘s3’)

dev_session = boto3.Session(profile_name=“cevo-dev”)
bedrock = dev_session.client(
        service_name=“bedrock-runtime”,
        region_name=“us-east-1”,
)


def encode_image(image_path):
with open(image_path, “rb”) as image_file:
    return base64.b64encode(image_file.read()).decode(“utf-8”)

image_path = “./images/invoice_iamge.jpg”
base64_image = encode_image(image_path)
 

First, we establish a connection to AWS Bedrock via the python library called litellm and set up session configuration. Then we have a function to encode raw image into binary form. 

Next, we create the prompt template to be sent to Claude 3 Haiku. 

image_path = “./images/26_png_jpg.rf.2318e3f83a44413b5b855925e571a538.jpg”
base64_image = encode_image(image_path)
resp = completion(
        model=“bedrock/anthropic.claude-3-haiku-20240307-v1:0”,
        messages=[
    {
        “role”: “user”,
        “content”: [
            {“type”: “text”, “text”: “”
                Your purpose is to analyze invoices. 
                Create a structured set of data in json format providing key information about an invoice.  
                Do not return any narrative language.
                JSON fields must be labelled as
                – INVOICE_NUMBER, 
                – CUSTOMER_NUMBER, 
                – CUSTOMER_NAME, 
                – VENDOR_NAME, 
                – INVOICE_AMOUNT (show currency), 
                – INVOICE_DATE (show in this format YYYY-MM-DD), 
                – VENDOR_ADDRESS. 
                Example json structure is:
                <json>
                {
                    “INVOICE_NUMBER”: “Invoice number”,
                    “CUSTOMER_NUMBER”: “Customer Number”,
                    “CUSTOMER_NAME”: “Customer Name”,
                    “VENDOR_NAME”: “Vendor Name”,
                    “INVOICE_AMOUNT”: “Invoice Amount (show currency)”,
                    “INVOICE_DATE”: “Invoice Date”,
                    “VENDOR_ADDRESS”: “Vendor Address”
                }
                </json>             

                Output the json structure as a string starting with <json> and ending with </json> XML tags.  
                Do not return any narrative language. Look at the images in detail, looking for invoice number, 
                customer number, customer name, vendor name, 
                invoice amount, invoice date, and vendor address where possible try to identify them.
                 
                IF YOU COULD NOT FIND THE RIGHT INFORMATION JUST RETURN THIS VALUE “UNK”.
            “”“},
            {
                “type”: “image_url”,
                “image_url”: {
                    “url”: “data:image/jpeg;base64,” + base64_image
                },
            },
        ],
    }
],
        aws_bedrock_client=bedrock,

Let’s make some key observations from our prompt: 

  • We have specified the fields we want to extract. 
  • We have specified the format we expect for some the keys (e.g. INVOICE_DATE). 
  • We have enforced outputs to be in JSON format. 
  • We have instructed the model to produce UNK in case it cannot return the requested information from the invoice. 

Running our code will produce: 

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(str(resp))

(“ModelResponse(id=’chatcmpl-51ae277d-4cac-4494-b18b-ce0df4eaf0c6′, ” “choices=[Choices(finish_reason=‘stop’, index=0, ” ‘message=Message(content=\‘<json>\\n{\\n “INVOICE_NUMBER“: “13579“,\\n ‘ ‘”CUSTOMER_NUMBER“: “UNK“,\\n “CUSTOMER_NAME“: “CHAD THE CLIENT“,\\n ‘ ‘”VENDOR_NAME“: “THE WRITE LIFE“,\\n “INVOICE_AMOUNT“: “$650“,\\n ‘ ‘”INVOICE_DATE“: “2020-07-01“,\\n “VENDOR_ADDRESS“: “P.O. Box 12345, ‘ ‘Anywhere, US 12345“\\n}\\n</json>\’, role=\’assistant\’))], ‘ “created=1713790938, model=’anthropic.claude-3-haiku-20240307-v1:0‘, “ “object=’chat.completion‘, system_fingerprint=None, “ ‘usage=Usage(prompt_tokens=989, completion_tokens=122, total_tokens=1111), ‘ “finish_reason=’end_turn‘)”

The next step will be to extract the JSON shaped output from the result above and for that we might create a customed function just for that: 

def extract_json_from_string(input_string):
# Define a regular expression pattern to find JSON objects
pattern = r'<json>(.*?)</json>’
# Search for the JSON pattern in the input string
match = re.search(pattern, input_string, re.DOTALL)
if match:
    # Extract the JSON string from the matched pattern
    json_string = match.group(1)
    # Clean Up
    cleaned_string = json_string.strip().replace(‘\n’, ).replace(‘/’, ‘-‘)
    cleaned_string = cleaned_string.strip(“‘”).replace(“\\n”, “”).replace(“\\”, “”)
    # Convert the string to a Python dictionary
    try:
        data_dict = eval(cleaned_string)
        return data_dict
    except SyntaxError:
        print(“Error: Unable to convert the string to a dictionary.”)
else:
    return None

# Extract JSON object from the input string
json_dict = extract_json_from_string(str(resp))

# Print the extracted JSON object
print(json.dumps(json_dict, indent=4))

{
“INVOICE_NUMBER”: “13579”,
“CUSTOMER_NUMBER”: “UNK”,
“CUSTOMER_NAME”: “CHAD THE CLIENT”,
“VENDOR_NAME”: “THE WRITE LIFE”,
“INVOICE_AMOUNT”: “$650”,
“INVOICE_DATE”: “2020-07-01”,
“VENDOR_ADDRESS”: “P.O. Box 12345, Anywhere, US 12345”

Comparing our results to the actual image with highlighted fields: 

 Claude 3 was able to extract the relevant fields and to tell us where the information was ambiguous. 

How Much Does This Cost?

For a very long time, Amazon Textract was considered the defacto option for anything OCR. Using machine learning, it can extract text, handwriting, layout elements and data from scanned documents. It goes beyond simple optical character recognition to identify, understand and extract specific data from documents.  

One of the main challenges with Amazon Textract is the cost of using the service. So, for that matter, let us evaluate an estimated cost of our solution with Textract versus Claude 3 Haiku: 

Bedrock Claude 3 Haiku (pricing) 

cost = 106/1000 * $0.00025 + 14/1000 * $0.00125 = $0.000044 

If we have 1000 invoices our total cost will be $0.044 

Amazon Textract (pricing) 

cost = Price for page with table and queries ($0.020) 

If we have 1000 invoices our total cost will be $20 

Closing Remarks

We are still in the early days of generative AI, but we can already see the infinite number of opportunities from using those foundational models, and Claude 3 Haiku is giving us this sentiment. Strong collaboration and a focus on innovation across industries will usher in a new era of generative AI. We cannot wait to see what customers build next. 

In part 2 of this series, we will explore a direct and deeper comparison between building an OCR solution using Claude 3 Haiku and Amazon Textract. 

Enjoyed this blog?

Share it with your network!

Move faster with confidence