Getting Started with Document Digitization

You must first obtain your API key to use the Document Digitization API. If you are curious about the API Key issuance process, please refer to 01. Introduction to Upstage API (LINK)!

1. What is Document Digitization?

Document digitization refers to converting documents into machine-readable formats, such as text, HTML, or Markdown. These digitized documents can then be used for various AI-driven tasks such as information retrieval, summarization, and data extraction.

Document Parsing

If you are participating in the AI Initiative program, you can access Document Parsing for free until March 31st, 2026. Apply here: LINK

Extracts and structures both text and layout elements (e.g., paragraphs, tables, images) into HTML or Markdown.
Transforms documents into formats that are easily understandable by LLMs, supporting various downstream tasks.
Includes OCR capabilities but goes beyond traditional OCR by preserving higher-level structural information.

Document OCR

Extracts only text and positional information, best suited for basic text recognition tasks.

👉 While Document Parsing uses OCR under the hood, it also includes advanced layout detection and table/chart recognition, making it significantly more powerful. If you only need fast text extraction, use Document OCR. Document parsing is recommended for structured data.

2. When Should You Use the Document Digitization API?

Document automation goes beyond simply extracting text. Its goal is to structure documents so that AI can understand and process them.

While the console demo is excellent for testing and learning, workflows such as service integration, automation pipelines, and production applications require API use.

✅ Console Demo: Manual upload and testing

✅ API: Systematic, automated document ingestion and processing in production

👁️‍🗨️ Preprocessing for LLM Input

Document Digitization is a key preprocessing step for LLM pipelines.

For LLM-based applications like chatbots, search, or summarization, documents must first be segmented into paragraphs, tables, images, etc., and their structure must be recognized. Document parsing is crucial to ensure the LLM can accurately interpret the document.

Most LLM pipelines operate in backend systems. API integration is essential for processing documents in real time.

Use Cases:

Building a RAG system that splits patent documents into paragraphs, then retrieves relevant content based on user queries.
Parsing academic papers to power apps with summarization and highlighting features.

Let’s dive deeper into Document Parse, the engine that powers such workflows.

3. What is Document Parse?

Upstage Document Parse automatically converts various types of documents into structured HTML. It detects layout elements such as paragraphs, tables, images, formulas, and charts, and serializes them in a logical reading order for LLMs to consume.

3.1. Demo: Financial Statement Analysis Chatbot

Before we explore the technical details, check out this demo:

This chatbot uses Document Parse to convert a financial statement into HTML, enabling free-form Q&A based on the content.

DEMO LINK

📄 Financial Statement Q&A Chatbot

✨ Key Features

Converts financial documents into HTML using Document Parse API
Enables document-based Q&A using Solar LLM

🖥️ Example Code

Complete source code and explanation:
🔗 Huggingface Repo

⚡️ Document Parse becomes the "eyes" of the LLM, enabling it to understand complex documents and generate accurate responses

3.2. Document Parse - Input, Output Format

📥 Input Requirements

Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
Max file size: 50MB
Max page count:
- Synchronous API: up to 100 pages
- Asynchronous API: up to 1,000 pages
Max pixels per page: 100 million pixels (based on 150 DPI conversion)
Supported languages (for OCR): Korean, English, Numbers
- Chinese (Hanzi-beta), Japanese (Kanji–beta),

📤 Output Structure

Documents include more than plain text—they contain headings, tables, charts, images, and more. The structure is preserved by converting these elements into HTML, allowing LLMs to understand the content accurately.

▶️ Layout Categories & HTML Tags

Document Parse uses HTML tags to express the layout of each document component:

Layout category

HTML Tag Example

table

<table>...</table>

figure

<figure><img>...</img></figure>

chart

<figure><img data-category="chart">...</img></figure>

heading1

<h1>...</h1>

paragraph

<p data-category="paragraph">...</p>

equation

<p data-category="equation">$$...$$</p>

list

<p data-category="list">...</p>

Other elements like headers, footers, captions, indexes, and footnotes are also recognized and appropriately tagged or assigned data-category attributes

▶️ Chart Recognition

Charts often embedded as images are interpreted and converted into a table format for further use.

Supported chart types: Bar, Line, Pie
Embedded in HTML as:
<figure data-category="chart"><table>...</table></figure>

▶️ Equation Recognition

Mathematical expressions are converted to LaTeX and wrapped with:

<p data-category="equation">$$...$$</p>

You can render these on web interfaces using libraries like MathJax.

▶️ Coordinates (Relative Position)

Each layout element includes position metadata using relative coordinates (0–1), which can be used for cropping or UI visualization.

For example:

"coordinates": [
  { "x": 0.0276, "y": 0.0178 },
  { "x": 0.1755, "y": 0.0178 },
  { "x": 0.1755, "y": 0.0641 },
  { "x": 0.0276, "y": 0.0641 }
]

3.3. Getting Started with Document Parse API

Depending on your use case, the API has two ways: 1/ synchronous and 2/ asynchronous.

🔁 Synchronous API

In synchronous mode, the API waits for the process to finish and returns the result immediately.

Think of it as ordering food and waiting at the restaurant for your dish.

Key Characteristics:

Supports up to 100 pages
Real-time response
Ideal for testing or low-latency needs

Python Example:

import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

response = requests.post(
    "<https://api.upstage.ai/v1/document-digitization>",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"document": open(filename, "rb")},
    data={
      "ocr": "force", # Force OCR (if set to "auto", OCR is performed on image documents only)
      "coordinates": True, # Whether to return position information for each layout element
      "chart_recognition": True, # Whether to recognize charts 
      "output_formats": '["html"]', # Return results in HTML format (text, markdown are also possible)
      "base64_encoding": '["table"]', # Request base64 encoding for table
      "model": "document-parse" # Specify model to use 
    )
print(response.json())

Expected Output:

{"api": "2.0",
  "content": {
    "html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>\\n<h1 id='1' style='font-size:20px'>Company<br>Upstage</h1>\\n<br><h1 id='2' style='font-size:18px'>Invoice ID</h1>\\n<br><h1 id='3' style='font-size:14px'>휴 INV-AJ355548</h1>\\n<h1 id='4' style='font-size:18px'>Invoice Date</h1>\\n<br><h1 id='5' style='font-size:18px'>9/7/1992</h1>\\n<h1 id='6' style='font-size:16px'>Mamo<br>Lucy Park</h1>\\n<h1 id='7' style='font-size:18px'>Address</h1>\\n<br><h1 id='8' style='font-size:16px'>7 Pepper Wood Street, 130 Stone Comer<br>Terrace<br>Wilkes Barre, Pennsylvania, 18768<br>United States</h1>\\n<h1 id='9' style='font-size:16px'>Email</h1>\\n<br><h1 id='10' style='font-size:16px'>Ikitchenman0@arizona.edu</h1>\\n<br><h1 id='11' style='font-size:20px'>Service Details Form</h1>\\n<h1 id='12' style='font-size:16px'>Name<br>Sung Kim</h1>\\n<h1 id='13' style='font-size:16px'>260 'ess<br>Gwangovolungang:co 338, Gyeongg do.<br>Sanghyeon-dong, Sui-gu<br>Yongin-si, South Korea</h1>\\n<h1 id='14' style='font-size:18px'>Additional Request</h1>\\n<br><p id='15' data-category='paragraph' style='font-size:14px'>Vivamus vestibulum sagittis sapien. Cum sociis natoque<br>penatibus 항목 magnis dfs parturient montes, nascetur ridiculus<br>mus.</p>\\n<h1 id='16' style='font-size:14px'>TERMS AND CONDITIONS</h1>\\n<p id='17' data-category='list' style='font-size:14px'>L TM Seir that not be lable 1층 the Buyer drectly indirectly for any loun or damage sufflered by 전액 Buyer<br>2. The 별 www. the product for ore 과 관한 from the date 설 shipment.<br>3. Any ourchase order received by ~ sele - be interpreted 추가 accepting the offer Ma the 18% offer writing The buyer may<br>purchase 15 The offer My the Terms and Conditions the Seller included The offer</p>",
    "markdown": "",
    "text": ""
  },
  "elements": [
    {
      "category": "heading1",
      "content": {
        "html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>",
        "markdown": "",
        "text": ""
      },
      "coordinates": [
        {
          "x": 0.0648,
          "y": 0.0517
        }, ...
        }
      ],
      "id": 0,
      "page": 1
    },
    ...

⏳ Asynchronous API

In asynchronous mode, the API immediately returns a request_id , and the results can be retrieved later.

Think of this like ordering take-away food and receiving a notification when your food is ready.

Key Characteristics:

Supports up to 1,000 pages
Returns request_id instantly
Ideal for large-scale batch processing

Step-by-Step Flow:

Send request → receive request_id
Use the status check API to monitor progress
Retrieve the final result from download_url

Python Example:

Sending an asynchronous request

# 1. Sending an asynchronous request
import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

url = "<https://api.upstage.ai/v1/document-digitization/async>"
headers = {"Authorization": f"Bearer {api_key}"}
files={"document": open(filename, "rb")},
data = {"model": "document-parse"}

response = requests.post(url, headers=headers, files=files, data = data)
print(response.json())  

# {"request_id": "e7b1..."}

Check request results with request_id

import requests

api_key = "UPSTAGE_API_KEY"
request_id = "enter_request_id_you_recieved"  # e.g. e7b1b3b0-1b3b-...

url = f"<https://api.upstage.ai/v1/document-digitization/requests/{request_id}>"
headers = {"Authorization": f"Bearer {api_key}"}

response = requests.get(url, headers=headers)
result = response.json()

Expected Output:

{
    "id": "e7b1b3b0-1b3b-4b3b-8b3b-1b3b3b3b3b3b",
    "status": "completed",
    "model": "document-parse",
    "failure_message": "",
    "total_pages": 28,
    "completed_pages": 28,
    "batches": [
        {
            "id": 0,
            "model": "document-parse-240910",
            "status": "completed",
            "failure_message": "",
            "download_url": "<https://download-url>",
            "start_page": 1,
            "end_page": 10,
            "requested_at": "2024-07-01T14:47:01.863880448Z",
            "updated_at": "2024-07-01T14:47:15.901662097Z"
        },
        ...

4. What is Document OCR?

Document OCR (Optical Character Recognition) is a technology that extracts text from images of documents.

Upstage Document OCR delivers fast and accurate recognition across various document formats.

When to Use OCR?

When you need just the text, not the layout or structure
For scanned images or photos
For simple preprocessing in automation pipelines

Example Use Cases:

Extracting names and ID numbers from scanned identity cards
OCR key points from whiteboard snapshots

4.1. Demo: Handwritten letter translator using Document OCR

In this demo, users can upload an image of a handwritten Korean letter. The system uses Upstage Document OCR API to extract the text, and Solar LLM to translate the content into English.

📩 Korean Handwriting Translator

DEMO LINK 🔗Upload your handwritten letters and try out Document OCR in action!

📩 The sample letter used in this demo is an actual handwritten note, recreated using generative AI based on the original from financial news from Korea.

✨ Key Features

Extracts handwritten text from images using Upstage Document OCR
Translates Korean to English using Solar LLM

🖥️ Example Code

Full implementation and details:
🔗 View on Huggingface

⚡️ Need quick text extraction from documents? Try Document OCR for lightweight and fast results!

4.2. Document OCR - Input, Output Format

📥 Input Requirements

Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
Max file size: 50MB
Max page count: 30 pages
Max pixels per page: 100 million (based on 150 DPI)
Supported languages: Korean, English, Chinese (Hanzi)
Text size condition: Optimized for text occupying ≤30% of the page. Larger text blocks may reduce accuracy.

📤 Output Structure

{
  "apiVersion": "1.1",
  "modelVersion": "ocr-2.2.1",
  "pages": [
    { "page": 1,
      "text": "Print the words \\\\nhello, world",
      "confidence": 0.99, 
      "words": [
        {
          "text": "hello",
          "boundingBox": {
            "vertices": [
              { "x": 65, "y": 52 },
              { "x": 221, "y": 55 },
              { "x": 221, "y": 104 },
              { "x": 65, "y": 104 } ]}}]}]}

Each recognized word includes:

text: the extracted string
confidence: recognition score (0.0–1.0)
boundingBox: word position in pixel coordinates

▶️ OCR Robustness

Upstage OCR remains accurate even under challenging conditions:

Rotated or skewed text
Background watermarks or checkboxes
Low-quality scans or document noise

The model precisely detects text boxes and filters out irrelevant elements like watermarks.

▶️ Confidence Score

Each word includes a confidence score based on character-level recognition.

Use this to:

Filter out low-confidence outputs
Prompt users to verify uncertain sections

4.3. Getting Started with Document OCR API

Here’s a simple example to get you started:

# pip install requests

import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

url = "<https://api.upstage.ai/v1/document-digitization>"
headers = {"Authorization": f"Bearer {api_key}"}
files = {"document": open(filename, "rb")}
data = {"model": "ocr"} 

response = requests.post(url, headers=headers, files=files, data = data)

print(response.json())

Sample response:

{
  "apiVersion": "1.1",
  "modelVersion": "ocr-2.2.1",
  "pages": [
    {
      "page": 1,
      "text": "Print the words \\nhello, world",
      "confidence": 0.99, 
      "words": [
        {
          "text": "hello",
          "boundingBox": {
            "vertices": [
              { "x": 65, "y": 52 },
              { "x": 221, "y": 55 },
              { "x": 221, "y": 104 },
              { "x": 65, "y": 104 } ]}}]}]}

WrapUp

Let’s summarize what we’ve learned:

🔹 What is Document Digitization?

The process of converting documents into machine-readable formats (HTML, Markdown, etc.) so AI systems can understand and use them for tasks like search, summarization, and Q&A. It’s the first step in document-based AI workflows.

🔹 Why is Document Digitization Important?

Most documents contain structured, visual elements like tables, charts, and headings.

LLMs struggle with unstructured content, so digitizing and structuring documents enables integration into automated pipelines and intelligent services.

🔹 OCR vs Document Parse API

Feature

Document OCR

Document Parse

Goal

Extract text quickly

Structure layout for LLM input

Input Type

Scanned images, photos

PDFs, Office docs, scanned files

Output

Plain text with coordinates

HTML, Markdown, and structure

Strength

Fast preprocessing

Deep document understanding

YoungHoon Jeon | AI Edu | Upstage

PreviousGetting Started with Solar Chat Next[ENG] Edu Full Package - Use Case Zone

Last updated 3 months ago