# Getting Started with Document Digitization

{% hint style="info" %}
You must first obtain your API key to use the Document Digitization API. If you are curious about the API Key issuance process, please refer to **01. Introduction to Upstage API** ([LINK](/upstage-edustage/basics/editor-3/introduction-to-upstage-api.md))!
{% endhint %}

{% embed url="<https://youtu.be/bOQaCj-Bf28>" %}

## 1. What is Document Digitization?

<figure><img src="/files/Cr24PAxOLdbuyh8NWPJ7" alt=""><figcaption></figcaption></figure>

Document digitization refers to converting documents into machine-readable formats, such as text, HTML, or Markdown. These digitized documents can then be used for various AI-driven tasks such as **information retrieval, summarization, and data extraction**.

### [**Document Parsing**](https://console.upstage.ai/docs/capabilities/document-digitization/document-parsing)

{% hint style="success" %}
If you are participating in the **AI Initiative program**, you can access Document Parsing for free until March 31st, 2026. Apply here: [LINK](https://www.upstage.ai/events/ai-initiative-2025-en)
{% endhint %}

* Extracts and structures both **text and layout elements** (e.g., paragraphs, tables, images) into HTML or Markdown.
* Transforms documents into formats that are easily understandable by LLMs, supporting various downstream tasks.
* Includes OCR capabilities but goes beyond traditional OCR by preserving higher-level structural information.

### [**Document OCR**](https://console.upstage.ai/docs/capabilities/document-digitization/document-ocr)

* Extracts only text and positional information, best suited for basic text recognition tasks.

👉 While Document Parsing uses OCR under the hood, it also includes **advanced layout detection** and **table/chart recognition**, making it significantly more powerful. If you only need fast text extraction, use Document OCR. Document parsing is recommended for structured data.

## 2. When Should You Use the Document Digitization API?

Document automation goes beyond simply extracting text. Its goal is to structure documents so that AI can **understand and process** them.

While the console demo is excellent for testing and learning, workflows such as **service integration, automation pipelines**, and **production applications** require **API use**.

✅ Console Demo: Manual upload and testing

✅ API: Systematic, automated document ingestion and processing in production

### 👁️‍🗨️ Preprocessing for LLM Input

Document Digitization is a key preprocessing step for LLM pipelines.

For LLM-based applications like chatbots, search, or summarization, documents must first be segmented into paragraphs, tables, images, etc., and their structure must be recognized. Document parsing is crucial to ensure the LLM can accurately interpret the document.

Most LLM pipelines operate in backend systems. API integration is essential for processing documents in real time.

**Use Cases**:

* Building a RAG system that splits patent documents into paragraphs, then retrieves relevant content based on user queries.
* Parsing academic papers to power apps with summarization and highlighting features.

Let’s dive deeper into Document Parse, the engine that powers such workflows.

## 3. What is Document Parse?

Upstage Document Parse automatically converts various types of documents into structured HTML. It detects layout elements such as paragraphs, tables, images, formulas, and charts, and serializes them in a logical reading order for LLMs to consume.

### 3.1. Demo: Financial Statement Analysis Chatbot

Before we explore the technical details, check out this demo:&#x20;

<figure><img src="/files/ZB1GuOAjQZtotywBtMzQ" alt=""><figcaption></figcaption></figure>

This chatbot uses Document Parse to convert a financial statement into HTML, enabling free-form Q\&A based on the content.

> [DEMO LINK](https://huggingface.co/spaces/Yescia/Document_Parse_Demo)

**📄 Financial Statement Q\&A Chatbot**

**✨ Key Features**

* Converts financial documents into HTML using Document Parse API
* Enables document-based Q\&A using Solar LLM

**🖥️ Example Code**

* Complete source code and explanation:

  🔗 [Huggingface Repo](https://huggingface.co/spaces/Yescia/Document_Parse_Demo/blob/main/app.py)

{% hint style="success" %}
⚡️ **Document Parse becomes the "eyes" of the LLM, enabling it to understand complex documents and generate accurate responses**
{% endhint %}

### 3.2. Document Parse - Input, Output Format

#### 📥 Input Requirements

* **Supported file formats**: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
* **Max file size**: 50MB
* **Max page count**:
  * Synchronous API: up to 100 pages
  * Asynchronous API: up to 1,000 pages
* **Max pixels per page**: 100 million pixels (based on 150 DPI conversion)
* **Supported languages (for OCR)**: Korean, English, Numbers
  * Chinese (Hanzi-beta), Japanese (Kanji–beta),

#### 📤 Output Structure

Documents include more than plain text—they contain headings, tables, charts, images, and more. The structure is preserved by converting these elements into HTML, allowing LLMs to understand the content accurately.

**▶️ Layout Categories & HTML Tags**

Document Parse uses HTML tags to express the layout of each document component:

<table><thead><tr><th width="163.33331298828125">Layout category</th><th>HTML Tag Example</th></tr></thead><tbody><tr><td>table</td><td><code>&#x3C;table>...&#x3C;/table></code></td></tr><tr><td>figure</td><td><code>&#x3C;figure>&#x3C;img>...&#x3C;/img>&#x3C;/figure></code></td></tr><tr><td>chart</td><td><code>&#x3C;figure>&#x3C;img data-category="chart">...&#x3C;/img>&#x3C;/figure></code></td></tr><tr><td>heading1</td><td><code>&#x3C;h1>...&#x3C;/h1></code></td></tr><tr><td>paragraph</td><td><code>&#x3C;p data-category="paragraph">...&#x3C;/p></code></td></tr><tr><td>equation</td><td><code>&#x3C;p data-category="equation">$$...$$&#x3C;/p></code></td></tr><tr><td>list</td><td><code>&#x3C;p data-category="list">...&#x3C;/p></code></td></tr></tbody></table>

Other elements like headers, footers, captions, indexes, and footnotes are also recognized and appropriately tagged or assigned `data-category` attributes

**▶️ Chart Recognition**

Charts often embedded as images are interpreted and converted into a **table format** for further use.

* Supported chart types: Bar, Line, Pie
* Embedded in HTML as:

  `<figure data-category="chart"><table>...</table></figure>`

**▶️ Equation Recognition**

Mathematical expressions are converted to LaTeX and wrapped with:

```html
<p data-category="equation">$$...$$</p>
```

You can render these on web interfaces using libraries like MathJax.

**▶️ Coordinates (Relative Position)**

* Each layout element includes position metadata using **relative coordinates (0–1)**, which can be used for cropping or UI visualization.
* For example:

  ```json
  "coordinates": [
    { "x": 0.0276, "y": 0.0178 },
    { "x": 0.1755, "y": 0.0178 },
    { "x": 0.1755, "y": 0.0641 },
    { "x": 0.0276, "y": 0.0641 }
  ]
  ```

### 3.3. Getting Started with Document Parse API

Depending on your use case, the API has two ways: 1/ synchronous and 2/ asynchronous.

#### 🔁 Synchronous API

In synchronous mode, the API waits for the process to finish and returns the result immediately.

Think of it as ordering food and waiting at the restaurant for your dish.

**Key Characteristics**:

* Supports up to 100 pages
* Real-time response
* Ideal for testing or low-latency needs

**Python Example:**

```python
import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

response = requests.post(
    "<https://api.upstage.ai/v1/document-digitization>",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"document": open(filename, "rb")},
    data={
      "ocr": "force", # Force OCR (if set to "auto", OCR is performed on image documents only)
      "coordinates": True, # Whether to return position information for each layout element
      "chart_recognition": True, # Whether to recognize charts 
      "output_formats": '["html"]', # Return results in HTML format (text, markdown are also possible)
      "base64_encoding": '["table"]', # Request base64 encoding for table
      "model": "document-parse" # Specify model to use 
    )
print(response.json())
```

**Expected Output:**

```json
{"api": "2.0",
  "content": {
    "html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>\\n<h1 id='1' style='font-size:20px'>Company<br>Upstage</h1>\\n<br><h1 id='2' style='font-size:18px'>Invoice ID</h1>\\n<br><h1 id='3' style='font-size:14px'>휴 INV-AJ355548</h1>\\n<h1 id='4' style='font-size:18px'>Invoice Date</h1>\\n<br><h1 id='5' style='font-size:18px'>9/7/1992</h1>\\n<h1 id='6' style='font-size:16px'>Mamo<br>Lucy Park</h1>\\n<h1 id='7' style='font-size:18px'>Address</h1>\\n<br><h1 id='8' style='font-size:16px'>7 Pepper Wood Street, 130 Stone Comer<br>Terrace<br>Wilkes Barre, Pennsylvania, 18768<br>United States</h1>\\n<h1 id='9' style='font-size:16px'>Email</h1>\\n<br><h1 id='10' style='font-size:16px'>Ikitchenman0@arizona.edu</h1>\\n<br><h1 id='11' style='font-size:20px'>Service Details Form</h1>\\n<h1 id='12' style='font-size:16px'>Name<br>Sung Kim</h1>\\n<h1 id='13' style='font-size:16px'>260 'ess<br>Gwangovolungang:co 338, Gyeongg do.<br>Sanghyeon-dong, Sui-gu<br>Yongin-si, South Korea</h1>\\n<h1 id='14' style='font-size:18px'>Additional Request</h1>\\n<br><p id='15' data-category='paragraph' style='font-size:14px'>Vivamus vestibulum sagittis sapien. Cum sociis natoque<br>penatibus 항목 magnis dfs parturient montes, nascetur ridiculus<br>mus.</p>\\n<h1 id='16' style='font-size:14px'>TERMS AND CONDITIONS</h1>\\n<p id='17' data-category='list' style='font-size:14px'>L TM Seir that not be lable 1층 the Buyer drectly indirectly for any loun or damage sufflered by 전액 Buyer<br>2. The 별 www. the product for ore 과 관한 from the date 설 shipment.<br>3. Any ourchase order received by ~ sele - be interpreted 추가 accepting the offer Ma the 18% offer writing The buyer may<br>purchase 15 The offer My the Terms and Conditions the Seller included The offer</p>",
    "markdown": "",
    "text": ""
  },
  "elements": [
    {
      "category": "heading1",
      "content": {
        "html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>",
        "markdown": "",
        "text": ""
      },
      "coordinates": [
        {
          "x": 0.0648,
          "y": 0.0517
        }, ...
        }
      ],
      "id": 0,
      "page": 1
    },
    ...
```

#### ⏳ Asynchronous API

In asynchronous mode, the API immediately returns a `request_id` , and the results can be retrieved later.

Think of this like ordering take-away food and receiving a notification when your food is ready.

**Key Characteristics**:

* Supports up to 1,000 pages
* Returns `request_id` instantly
* Ideal for large-scale batch processing

**Step-by-Step Flow**:

1. Send request → receive `request_id`
2. Use the status check API to monitor progress
3. Retrieve the final result from `download_url`

**Python Example:**

1. **Sending an asynchronous request**

```python
# 1. Sending an asynchronous request
import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

url = "<https://api.upstage.ai/v1/document-digitization/async>"
headers = {"Authorization": f"Bearer {api_key}"}
files={"document": open(filename, "rb")},
data = {"model": "document-parse"}

response = requests.post(url, headers=headers, files=files, data = data)
print(response.json())  

# {"request_id": "e7b1..."}
```

2. **Check request results with `request_id`**

```python
import requests

api_key = "UPSTAGE_API_KEY"
request_id = "enter_request_id_you_recieved"  # e.g. e7b1b3b0-1b3b-...

url = f"<https://api.upstage.ai/v1/document-digitization/requests/{request_id}>"
headers = {"Authorization": f"Bearer {api_key}"}

response = requests.get(url, headers=headers)
result = response.json()

```

**Expected Output:**

```json
{
    "id": "e7b1b3b0-1b3b-4b3b-8b3b-1b3b3b3b3b3b",
    "status": "completed",
    "model": "document-parse",
    "failure_message": "",
    "total_pages": 28,
    "completed_pages": 28,
    "batches": [
        {
            "id": 0,
            "model": "document-parse-240910",
            "status": "completed",
            "failure_message": "",
            "download_url": "<https://download-url>",
            "start_page": 1,
            "end_page": 10,
            "requested_at": "2024-07-01T14:47:01.863880448Z",
            "updated_at": "2024-07-01T14:47:15.901662097Z"
        },
        ...
```

## 4. What is Document OCR?

Document OCR (Optical Character Recognition) is a technology that extracts text from images of documents.

Upstage Document OCR delivers fast and accurate recognition across various document formats.

**When to Use OCR?**

* When you need just the text, not the layout or structure
* For scanned images or photos
* For simple preprocessing in automation pipelines

**Example Use Cases:**

* Extracting names and ID numbers from scanned identity cards
* OCR key points from whiteboard snapshots

### 4.1. Demo: Handwritten letter translator using Document OCR

In this demo, users can upload an image of a handwritten Korean letter. The system uses **Upstage Document OCR API** to extract the text, and **Solar LLM** to translate the content into English.

📩 **Korean Handwriting Translator**

<figure><img src="/files/1DfqXoObEoQMMH1GRn0I" alt=""><figcaption></figcaption></figure>

> [**DEMO LINK**](https://huggingface.co/spaces/Yescia/Document_OCR_Demo)\
> 🔗Upload your handwritten letters and try out Document OCR in action!

📩 The sample letter used in this demo is an actual handwritten note, recreated using generative AI based on the original from [*financial news from Korea*](https://www.fnnews.com/news/202110091005245880).

**✨ Key Features**

* Extracts handwritten text from images using Upstage Document OCR
* Translates Korean to English using Solar LLM

**🖥️ Example Code**

* Full implementation and details:

  🔗 [View on Huggingface](https://huggingface.co/spaces/Yescia/Document_OCR_Demo/blob/main/app.py)

{% hint style="success" %}
⚡️ Need quick text extraction from documents? Try Document OCR for lightweight and fast results!
{% endhint %}

### 4.2. Document OCR - Input, Output Format

#### 📥 Input Requirements

* **Supported file formats**: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
* **Max file size**: 50MB
* **Max page count**: 30 pages
* **Max pixels per page**: 100 million (based on 150 DPI)
* **Supported languages**: Korean, English, Chinese (Hanzi)
* **Text size condition**: Optimized for text occupying ≤30% of the page. Larger text blocks may reduce accuracy.

#### 📤 Output Structure

```python
{
  "apiVersion": "1.1",
  "modelVersion": "ocr-2.2.1",
  "pages": [
    { "page": 1,
      "text": "Print the words \\\\nhello, world",
      "confidence": 0.99, 
      "words": [
        {
          "text": "hello",
          "boundingBox": {
            "vertices": [
              { "x": 65, "y": 52 },
              { "x": 221, "y": 55 },
              { "x": 221, "y": 104 },
              { "x": 65, "y": 104 } ]}}]}]}
```

Each recognized word includes:

* `text`: the extracted string
* `confidence`: recognition score (0.0–1.0)
* `boundingBox`: word position in pixel coordinates

**▶️ OCR Robustness**

Upstage OCR remains accurate even under challenging conditions:

* Rotated or skewed text
* Background watermarks or checkboxes
* Low-quality scans or document noise

> The model precisely detects text boxes and filters out irrelevant elements like watermarks.

**▶️ Confidence Score**

Each word includes a **confidence score** based on character-level recognition.

Use this to:

* Filter out low-confidence outputs
* Prompt users to verify uncertain sections

### 4.3. Getting Started with Document OCR API

Here’s a simple example to get you started:

```python
# pip install requests

import requests

api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png

url = "<https://api.upstage.ai/v1/document-digitization>"
headers = {"Authorization": f"Bearer {api_key}"}
files = {"document": open(filename, "rb")}
data = {"model": "ocr"} 

response = requests.post(url, headers=headers, files=files, data = data)

print(response.json())
```

* Sample response:

  ```json
  {
    "apiVersion": "1.1",
    "modelVersion": "ocr-2.2.1",
    "pages": [
      {
        "page": 1,
        "text": "Print the words \\nhello, world",
        "confidence": 0.99, 
        "words": [
          {
            "text": "hello",
            "boundingBox": {
              "vertices": [
                { "x": 65, "y": 52 },
                { "x": 221, "y": 55 },
                { "x": 221, "y": 104 },
                { "x": 65, "y": 104 } ]}}]}]}
  ```

***

## WrapUp

Let’s summarize what we’ve learned:

**🔹 What is Document Digitization?**

The process of converting documents into machine-readable formats (HTML, Markdown, etc.) so AI systems can understand and use them for tasks like search, summarization, and Q\&A. It’s the **first step** in document-based AI workflows.

**🔹 Why is Document Digitization Important?**

Most documents contain structured, visual elements like tables, charts, and headings.

LLMs struggle with unstructured content, so digitizing and structuring documents enables integration into **automated pipelines and intelligent services**.

🔹 **OCR vs Document Parse API**

<table><thead><tr><th width="128.33331298828125">Feature</th><th width="299">Document OCR</th><th>Document Parse</th></tr></thead><tbody><tr><td>Goal</td><td>Extract text quickly</td><td>Structure layout for LLM input</td></tr><tr><td>Input Type</td><td>Scanned images, photos</td><td>PDFs, Office docs, scanned files</td></tr><tr><td>Output</td><td>Plain text with coordinates</td><td>HTML, Markdown, and structure</td></tr><tr><td>Strength</td><td>Fast preprocessing</td><td>Deep document understanding</td></tr></tbody></table>

**YoungHoon Jeon** | **AI Edu** | **Upstage**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://upstage-ai-education.gitbook.io/upstage-edustage/basics/editor-3/getting-started-with-document-digitization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
