🚀
Upstage EduStage
  • Welcome
  • Getting Started
    • Introduction to Upstage x AWS AI Initiative
    • Getting Started with Upstage Edu Full Package
  • Basics
    • [KOR] Edu Full Package - No-Code Zone
      • Introduction to LLM
      • Capabilities of LLM
      • Introduction to Solar
      • Introduction to Embedding
      • Introduction to Document AI
      • Introduction to Document Parse
    • [KOR] Edu Full Package - Dev Start Zone
      • Introduction to Upstage API
      • Getting Started with Solar Chat
      • Getting Started with Document Digitization
    • [KOR] Edu Full Package - Use Case Zone
      • Introduction to RAG
      • Introduction to AI Agent
    • [ENG] Edu Full Package - No-Code Zone
      • Introduction to LLM
      • Capabilities of LLM
      • Introduction to Solar
      • Introduction to Embedding
      • Introduction to Document AI
      • Introduction to Document Parse
    • [ENG] Edu Full Package - Dev Start Zone
      • Introduction to Upstage API
      • Getting Started with Solar Chat
      • Getting Started with Document Digitization
    • [ENG] Edu Full Package - Use Case Zone
      • Introduction to RAG
      • Introduction to AI Agent
Powered by GitBook
On this page
  • 1. What is Document Digitization?
  • Document Parsing
  • Document OCR
  • 2. When Should You Use the Document Digitization API?
  • 👁️‍🗨️ Preprocessing for LLM Input
  • 3. What is Document Parse?
  • 3.1. Demo: Financial Statement Analysis Chatbot
  • 3.2. Document Parse - Input, Output Format
  • 3.3. Getting Started with Document Parse API
  • 4. What is Document OCR?
  • 4.1. Demo: Handwritten letter translator using Document OCR
  • 4.2. Document OCR - Input, Output Format
  • 4.3. Getting Started with Document OCR API
  • WrapUp
  1. Basics
  2. [ENG] Edu Full Package - Dev Start Zone

Getting Started with Document Digitization

PreviousGetting Started with Solar ChatNext[ENG] Edu Full Package - Use Case Zone

Last updated 2 months ago

You must first obtain your API key to use the Document Digitization API. If you are curious about the API Key issuance process, please refer to 01. Introduction to Upstage API ()!

1. What is Document Digitization?

Document digitization refers to converting documents into machine-readable formats, such as text, HTML, or Markdown. These digitized documents can then be used for various AI-driven tasks such as information retrieval, summarization, and data extraction.

  • Extracts and structures both text and layout elements (e.g., paragraphs, tables, images) into HTML or Markdown.

  • Transforms documents into formats that are easily understandable by LLMs, supporting various downstream tasks.

  • Includes OCR capabilities but goes beyond traditional OCR by preserving higher-level structural information.

  • Extracts only text and positional information, best suited for basic text recognition tasks.

👉 While Document Parsing uses OCR under the hood, it also includes advanced layout detection and table/chart recognition, making it significantly more powerful. If you only need fast text extraction, use Document OCR. Document parsing is recommended for structured data.

2. When Should You Use the Document Digitization API?

Document automation goes beyond simply extracting text. Its goal is to structure documents so that AI can understand and process them.

While the console demo is excellent for testing and learning, workflows such as service integration, automation pipelines, and production applications require API use.

✅ Console Demo: Manual upload and testing

✅ API: Systematic, automated document ingestion and processing in production

👁️‍🗨️ Preprocessing for LLM Input

Document Digitization is a key preprocessing step for LLM pipelines.

For LLM-based applications like chatbots, search, or summarization, documents must first be segmented into paragraphs, tables, images, etc., and their structure must be recognized. Document parsing is crucial to ensure the LLM can accurately interpret the document.

Most LLM pipelines operate in backend systems. API integration is essential for processing documents in real time.

Use Cases:

  • Building a RAG system that splits patent documents into paragraphs, then retrieves relevant content based on user queries.

  • Parsing academic papers to power apps with summarization and highlighting features.

Let’s dive deeper into Document Parse, the engine that powers such workflows.

3. What is Document Parse?

Upstage Document Parse automatically converts various types of documents into structured HTML. It detects layout elements such as paragraphs, tables, images, formulas, and charts, and serializes them in a logical reading order for LLMs to consume.

3.1. Demo: Financial Statement Analysis Chatbot

Before we explore the technical details, check out this demo:

This chatbot uses Document Parse to convert a financial statement into HTML, enabling free-form Q&A based on the content.

📄 Financial Statement Q&A Chatbot

✨ Key Features

  • Converts financial documents into HTML using Document Parse API

  • Enables document-based Q&A using Solar LLM

🖥️ Example Code

  • Complete source code and explanation:

⚡️ Document Parse becomes the "eyes" of the LLM, enabling it to understand complex documents and generate accurate responses

3.2. Document Parse - Input, Output Format

📥 Input Requirements

  • Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX

  • Max file size: 50MB

  • Max page count:

    • Synchronous API: up to 100 pages

    • Asynchronous API: up to 1,000 pages

  • Max pixels per page: 100 million pixels (based on 150 DPI conversion)

  • Supported languages (for OCR): Korean, English, Numbers

    • Chinese (Hanzi-beta), Japanese (Kanji–beta),

📤 Output Structure

Documents include more than plain text—they contain headings, tables, charts, images, and more. The structure is preserved by converting these elements into HTML, allowing LLMs to understand the content accurately.

▶️ Layout Categories & HTML Tags

Document Parse uses HTML tags to express the layout of each document component:

Layout category
HTML Tag Example

table

<table>...</table>

figure

<figure><img>...</img></figure>

chart

<figure><img data-category="chart">...</img></figure>

heading1

<h1>...</h1>

paragraph

<p data-category="paragraph">...</p>

equation

<p data-category="equation">$$...$$</p>

list

<p data-category="list">...</p>

Other elements like headers, footers, captions, indexes, and footnotes are also recognized and appropriately tagged or assigned data-category attributes

▶️ Chart Recognition

Charts often embedded as images are interpreted and converted into a table format for further use.

  • Supported chart types: Bar, Line, Pie

  • Embedded in HTML as:

    <figure data-category="chart"><table>...</table></figure>

▶️ Equation Recognition

Mathematical expressions are converted to LaTeX and wrapped with:

<p data-category="equation">$$...$$</p>

You can render these on web interfaces using libraries like MathJax.

▶️ Coordinates (Relative Position)

  • Each layout element includes position metadata using relative coordinates (0–1), which can be used for cropping or UI visualization.

  • For example:

    "coordinates": [
      { "x": 0.0276, "y": 0.0178 },
      { "x": 0.1755, "y": 0.0178 },
      { "x": 0.1755, "y": 0.0641 },
      { "x": 0.0276, "y": 0.0641 }
    ]

3.3. Getting Started with Document Parse API

Depending on your use case, the API has two ways: 1/ synchronous and 2/ asynchronous.

🔁 Synchronous API

In synchronous mode, the API waits for the process to finish and returns the result immediately.

Think of it as ordering food and waiting at the restaurant for your dish.

Key Characteristics:

  • Supports up to 100 pages

  • Real-time response

  • Ideal for testing or low-latency needs

Python Example:

import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

response = requests.post(
    "<https://api.upstage.ai/v1/document-digitization>",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"document": open(filename, "rb")},
    data={
      "ocr": "force", # Force OCR (if set to "auto", OCR is performed on image documents only)
      "coordinates": True, # Whether to return position information for each layout element
      "chart_recognition": True, # Whether to recognize charts 
      "output_formats": '["html"]', # Return results in HTML format (text, markdown are also possible)
      "base64_encoding": '["table"]', # Request base64 encoding for table
      "model": "document-parse" # Specify model to use 
    )
print(response.json())

Expected Output:

{"api": "2.0",
  "content": {
    "html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>\\n<h1 id='1' style='font-size:20px'>Company<br>Upstage</h1>\\n<br><h1 id='2' style='font-size:18px'>Invoice ID</h1>\\n<br><h1 id='3' style='font-size:14px'>휴 INV-AJ355548</h1>\\n<h1 id='4' style='font-size:18px'>Invoice Date</h1>\\n<br><h1 id='5' style='font-size:18px'>9/7/1992</h1>\\n<h1 id='6' style='font-size:16px'>Mamo<br>Lucy Park</h1>\\n<h1 id='7' style='font-size:18px'>Address</h1>\\n<br><h1 id='8' style='font-size:16px'>7 Pepper Wood Street, 130 Stone Comer<br>Terrace<br>Wilkes Barre, Pennsylvania, 18768<br>United States</h1>\\n<h1 id='9' style='font-size:16px'>Email</h1>\\n<br><h1 id='10' style='font-size:16px'>Ikitchenman0@arizona.edu</h1>\\n<br><h1 id='11' style='font-size:20px'>Service Details Form</h1>\\n<h1 id='12' style='font-size:16px'>Name<br>Sung Kim</h1>\\n<h1 id='13' style='font-size:16px'>260 'ess<br>Gwangovolungang:co 338, Gyeongg do.<br>Sanghyeon-dong, Sui-gu<br>Yongin-si, South Korea</h1>\\n<h1 id='14' style='font-size:18px'>Additional Request</h1>\\n<br><p id='15' data-category='paragraph' style='font-size:14px'>Vivamus vestibulum sagittis sapien. Cum sociis natoque<br>penatibus 항목 magnis dfs parturient montes, nascetur ridiculus<br>mus.</p>\\n<h1 id='16' style='font-size:14px'>TERMS AND CONDITIONS</h1>\\n<p id='17' data-category='list' style='font-size:14px'>L TM Seir that not be lable 1층 the Buyer drectly indirectly for any loun or damage sufflered by 전액 Buyer<br>2. The 별 www. the product for ore 과 관한 from the date 설 shipment.<br>3. Any ourchase order received by ~ sele - be interpreted 추가 accepting the offer Ma the 18% offer writing The buyer may<br>purchase 15 The offer My the Terms and Conditions the Seller included The offer</p>",
    "markdown": "",
    "text": ""
  },
  "elements": [
    {
      "category": "heading1",
      "content": {
        "html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>",
        "markdown": "",
        "text": ""
      },
      "coordinates": [
        {
          "x": 0.0648,
          "y": 0.0517
        }, ...
        }
      ],
      "id": 0,
      "page": 1
    },
    ...

⏳ Asynchronous API

In asynchronous mode, the API immediately returns a request_id , and the results can be retrieved later.

Think of this like ordering take-away food and receiving a notification when your food is ready.

Key Characteristics:

  • Supports up to 1,000 pages

  • Returns request_id instantly

  • Ideal for large-scale batch processing

Step-by-Step Flow:

  1. Send request → receive request_id

  2. Use the status check API to monitor progress

  3. Retrieve the final result from download_url

Python Example:

  1. Sending an asynchronous request

# 1. Sending an asynchronous request
import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

url = "<https://api.upstage.ai/v1/document-digitization/async>"
headers = {"Authorization": f"Bearer {api_key}"}
files={"document": open(filename, "rb")},
data = {"model": "document-parse"}

response = requests.post(url, headers=headers, files=files, data = data)
print(response.json())  

# {"request_id": "e7b1..."}
  1. Check request results with request_id

import requests

api_key = "UPSTAGE_API_KEY"
request_id = "enter_request_id_you_recieved"  # e.g. e7b1b3b0-1b3b-...

url = f"<https://api.upstage.ai/v1/document-digitization/requests/{request_id}>"
headers = {"Authorization": f"Bearer {api_key}"}

response = requests.get(url, headers=headers)
result = response.json()

Expected Output:

{
    "id": "e7b1b3b0-1b3b-4b3b-8b3b-1b3b3b3b3b3b",
    "status": "completed",
    "model": "document-parse",
    "failure_message": "",
    "total_pages": 28,
    "completed_pages": 28,
    "batches": [
        {
            "id": 0,
            "model": "document-parse-240910",
            "status": "completed",
            "failure_message": "",
            "download_url": "<https://download-url>",
            "start_page": 1,
            "end_page": 10,
            "requested_at": "2024-07-01T14:47:01.863880448Z",
            "updated_at": "2024-07-01T14:47:15.901662097Z"
        },
        ...

4. What is Document OCR?

Document OCR (Optical Character Recognition) is a technology that extracts text from images of documents.

Upstage Document OCR delivers fast and accurate recognition across various document formats.

When to Use OCR?

  • When you need just the text, not the layout or structure

  • For scanned images or photos

  • For simple preprocessing in automation pipelines

Example Use Cases:

  • Extracting names and ID numbers from scanned identity cards

  • OCR key points from whiteboard snapshots

4.1. Demo: Handwritten letter translator using Document OCR

In this demo, users can upload an image of a handwritten Korean letter. The system uses Upstage Document OCR API to extract the text, and Solar LLM to translate the content into English.

📩 Korean Handwriting Translator

✨ Key Features

  • Extracts handwritten text from images using Upstage Document OCR

  • Translates Korean to English using Solar LLM

🖥️ Example Code

  • Full implementation and details:

⚡️ Need quick text extraction from documents? Try Document OCR for lightweight and fast results!

4.2. Document OCR - Input, Output Format

📥 Input Requirements

  • Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX

  • Max file size: 50MB

  • Max page count: 30 pages

  • Max pixels per page: 100 million (based on 150 DPI)

  • Supported languages: Korean, English, Chinese (Hanzi)

  • Text size condition: Optimized for text occupying ≤30% of the page. Larger text blocks may reduce accuracy.

📤 Output Structure

{
  "apiVersion": "1.1",
  "modelVersion": "ocr-2.2.1",
  "pages": [
    { "page": 1,
      "text": "Print the words \\\\nhello, world",
      "confidence": 0.99, 
      "words": [
        {
          "text": "hello",
          "boundingBox": {
            "vertices": [
              { "x": 65, "y": 52 },
              { "x": 221, "y": 55 },
              { "x": 221, "y": 104 },
              { "x": 65, "y": 104 } ]}}]}]}

Each recognized word includes:

  • text: the extracted string

  • confidence: recognition score (0.0–1.0)

  • boundingBox: word position in pixel coordinates

▶️ OCR Robustness

Upstage OCR remains accurate even under challenging conditions:

  • Rotated or skewed text

  • Background watermarks or checkboxes

  • Low-quality scans or document noise

The model precisely detects text boxes and filters out irrelevant elements like watermarks.

▶️ Confidence Score

Each word includes a confidence score based on character-level recognition.

Use this to:

  • Filter out low-confidence outputs

  • Prompt users to verify uncertain sections

4.3. Getting Started with Document OCR API

Here’s a simple example to get you started:

# pip install requests

import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

url = "<https://api.upstage.ai/v1/document-digitization>"
headers = {"Authorization": f"Bearer {api_key}"}
files = {"document": open(filename, "rb")}
data = {"model": "ocr"} 

response = requests.post(url, headers=headers, files=files, data = data)

print(response.json())
  • Sample response:

    {
      "apiVersion": "1.1",
      "modelVersion": "ocr-2.2.1",
      "pages": [
        {
          "page": 1,
          "text": "Print the words \\nhello, world",
          "confidence": 0.99, 
          "words": [
            {
              "text": "hello",
              "boundingBox": {
                "vertices": [
                  { "x": 65, "y": 52 },
                  { "x": 221, "y": 55 },
                  { "x": 221, "y": 104 },
                  { "x": 65, "y": 104 } ]}}]}]}

WrapUp

Let’s summarize what we’ve learned:

🔹 What is Document Digitization?

The process of converting documents into machine-readable formats (HTML, Markdown, etc.) so AI systems can understand and use them for tasks like search, summarization, and Q&A. It’s the first step in document-based AI workflows.

🔹 Why is Document Digitization Important?

Most documents contain structured, visual elements like tables, charts, and headings.

LLMs struggle with unstructured content, so digitizing and structuring documents enables integration into automated pipelines and intelligent services.

🔹 OCR vs Document Parse API

Feature
Document OCR
Document Parse

Goal

Extract text quickly

Structure layout for LLM input

Input Type

Scanned images, photos

PDFs, Office docs, scanned files

Output

Plain text with coordinates

HTML, Markdown, and structure

Strength

Fast preprocessing

Deep document understanding

YoungHoon Jeon | AI Edu | Upstage

If you are participating in the AI Initiative program, you can access Document Parsing for free until March 31st, 2026. Apply here:

🔗

🔗Upload your handwritten letters and try out Document OCR in action!

📩 The sample letter used in this demo is an actual handwritten note, recreated using generative AI based on the original from .

🔗

Document Parsing
LINK
Document OCR
DEMO LINK
Huggingface Repo
DEMO LINK
financial news from Korea
View on Huggingface
LINK