How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches

Introduction

Extracting structured data from B2B documents—such as purchase orders, invoices, or delivery notes—is a common challenge. Two primary approaches exist: a traditional rule-based method using pytesseract for OCR and regex for parsing, and a modern LLM-based method using Ollama with LLaMA 3. This guide walks you through building both versions of the same document extractor, comparing their strengths and tradeoffs using a realistic B2B order scenario. By the end, you'll be able to choose the right approach for your own projects.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

What You Need

Step-by-Step Instructions

Step 1: Set Up the Environment

Create a new Python virtual environment and install all required packages:

pip install pytesseract pdf2image Pillow requests

Ensure Tesseract OCR is installed globally (sudo apt install tesseract-ocr on Linux, or download the Windows installer). Also install and start Ollama, then pull the LLaMA 3 model:

ollama pull llama3

Step 2: Convert PDF to Images

B2B documents are often scanned PDFs. Use pdf2image to turn each page into a PNG image. Write a function that:

Step 3: Perform OCR with pytesseract

For each image, call pytesseract.image_to_string() to extract raw text. This step is identical for both rule-based and LLM approaches, as they both need the text first. Store the extracted text per page.

Step 4: Build the Rule-Based Extractor

Use regular expressions and string logic to locate fields like Order Number, Date, Client Name, and Line Items. For example:

This method is fast and predictable, but fragile if the document format changes.

How to Build a B2B Document Extractor: Rule-Based vs. LLM Approaches
Source: towardsdatascience.com

Step 5: Build the LLM-Based Extractor

Instead of writing rules, send the extracted text to LLaMA 3 via Ollama’s API. Send a structured prompt that asks the model to extract specific fields in JSON format:

prompt = f"""
Extract the following information from this purchase order:
- order_number
- date
- client_name
- line_items (array of objects with 'item', 'quantity', 'price')
Return only valid JSON.

Text:
{text}
"""

Use the requests library to call Ollama:

response = requests.post('http://localhost:11434/api/generate', json={'model':'llama3', 'prompt':prompt, 'stream':False})

Parse the JSON from the response.

Step 6: Compare Outputs

Run both extractors on the same set of PDFs and compare:

The original experiment showed that the rule-based approach failed on a slightly different document format, while the LLM gracefully adapted—but hallucinated one item.

Tips for Success

By following these steps, you can build your own B2B document extractor and decide which approach best fits your needs. For a deep dive into the original comparison, see the full article.

Recommended

Discover More

Paramount+ Drops Weekly Must-Watch Movie List: 4 Picks for May 11-17How to Secure a Mac mini or Mac Studio Despite Ongoing Supply ConstraintsExploring CSS Color Palettes Beyond Tailwind: A Curated CollectionThe Downfall of 'Tylerb': Inside the Scattered Spider Cybercrime Kingpin's Guilty PleaAurzen Roku TV Smart Projector Deal: Your Top Questions Answered