Getting Started with Document Digitization
1. What is Document Digitization?

Document digitization refers to converting documents into machine-readable formats, such as text, HTML, or Markdown. These digitized documents can then be used for various AI-driven tasks such as information retrieval, summarization, and data extraction.
If you are participating in the AI Initiative program, you can access Document Parsing for free until March 31st, 2026. Apply here: LINK
Extracts and structures both text and layout elements (e.g., paragraphs, tables, images) into HTML or Markdown.
Transforms documents into formats that are easily understandable by LLMs, supporting various downstream tasks.
Includes OCR capabilities but goes beyond traditional OCR by preserving higher-level structural information.
Extracts only text and positional information, best suited for basic text recognition tasks.
👉 While Document Parsing uses OCR under the hood, it also includes advanced layout detection and table/chart recognition, making it significantly more powerful. If you only need fast text extraction, use Document OCR. Document parsing is recommended for structured data.
2. When Should You Use the Document Digitization API?
Document automation goes beyond simply extracting text. Its goal is to structure documents so that AI can understand and process them.
While the console demo is excellent for testing and learning, workflows such as service integration, automation pipelines, and production applications require API use.
✅ Console Demo: Manual upload and testing
✅ API: Systematic, automated document ingestion and processing in production
👁️🗨️ Preprocessing for LLM Input
Document Digitization is a key preprocessing step for LLM pipelines.
For LLM-based applications like chatbots, search, or summarization, documents must first be segmented into paragraphs, tables, images, etc., and their structure must be recognized. Document parsing is crucial to ensure the LLM can accurately interpret the document.
Most LLM pipelines operate in backend systems. API integration is essential for processing documents in real time.
Use Cases:
Building a RAG system that splits patent documents into paragraphs, then retrieves relevant content based on user queries.
Parsing academic papers to power apps with summarization and highlighting features.
Let’s dive deeper into Document Parse, the engine that powers such workflows.
3. What is Document Parse?
Upstage Document Parse automatically converts various types of documents into structured HTML. It detects layout elements such as paragraphs, tables, images, formulas, and charts, and serializes them in a logical reading order for LLMs to consume.
3.1. Demo: Financial Statement Analysis Chatbot
Before we explore the technical details, check out this demo:

This chatbot uses Document Parse to convert a financial statement into HTML, enabling free-form Q&A based on the content.
📄 Financial Statement Q&A Chatbot
✨ Key Features
Converts financial documents into HTML using Document Parse API
Enables document-based Q&A using Solar LLM
🖥️ Example Code
Complete source code and explanation:
⚡️ Document Parse becomes the "eyes" of the LLM, enabling it to understand complex documents and generate accurate responses
3.2. Document Parse - Input, Output Format
📥 Input Requirements
Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
Max file size: 50MB
Max page count:
Synchronous API: up to 100 pages
Asynchronous API: up to 1,000 pages
Max pixels per page: 100 million pixels (based on 150 DPI conversion)
Supported languages (for OCR): Korean, English, Numbers
Chinese (Hanzi-beta), Japanese (Kanji–beta),
📤 Output Structure
Documents include more than plain text—they contain headings, tables, charts, images, and more. The structure is preserved by converting these elements into HTML, allowing LLMs to understand the content accurately.
▶️ Layout Categories & HTML Tags
Document Parse uses HTML tags to express the layout of each document component:
table
<table>...</table>
figure
<figure><img>...</img></figure>
chart
<figure><img data-category="chart">...</img></figure>
heading1
<h1>...</h1>
paragraph
<p data-category="paragraph">...</p>
equation
<p data-category="equation">$$...$$</p>
list
<p data-category="list">...</p>
Other elements like headers, footers, captions, indexes, and footnotes are also recognized and appropriately tagged or assigned data-category
attributes
▶️ Chart Recognition
Charts often embedded as images are interpreted and converted into a table format for further use.
Supported chart types: Bar, Line, Pie
Embedded in HTML as:
<figure data-category="chart"><table>...</table></figure>
▶️ Equation Recognition
Mathematical expressions are converted to LaTeX and wrapped with:
<p data-category="equation">$$...$$</p>
You can render these on web interfaces using libraries like MathJax.
▶️ Coordinates (Relative Position)
Each layout element includes position metadata using relative coordinates (0–1), which can be used for cropping or UI visualization.
For example:
"coordinates": [ { "x": 0.0276, "y": 0.0178 }, { "x": 0.1755, "y": 0.0178 }, { "x": 0.1755, "y": 0.0641 }, { "x": 0.0276, "y": 0.0641 } ]
3.3. Getting Started with Document Parse API
Depending on your use case, the API has two ways: 1/ synchronous and 2/ asynchronous.
🔁 Synchronous API
In synchronous mode, the API waits for the process to finish and returns the result immediately.
Think of it as ordering food and waiting at the restaurant for your dish.
Key Characteristics:
Supports up to 100 pages
Real-time response
Ideal for testing or low-latency needs
Python Example:
import requests
api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png
response = requests.post(
"<https://api.upstage.ai/v1/document-digitization>",
headers={"Authorization": f"Bearer {api_key}"},
files={"document": open(filename, "rb")},
data={
"ocr": "force", # Force OCR (if set to "auto", OCR is performed on image documents only)
"coordinates": True, # Whether to return position information for each layout element
"chart_recognition": True, # Whether to recognize charts
"output_formats": '["html"]', # Return results in HTML format (text, markdown are also possible)
"base64_encoding": '["table"]', # Request base64 encoding for table
"model": "document-parse" # Specify model to use
)
print(response.json())
Expected Output:
{"api": "2.0",
"content": {
"html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>\\n<h1 id='1' style='font-size:20px'>Company<br>Upstage</h1>\\n<br><h1 id='2' style='font-size:18px'>Invoice ID</h1>\\n<br><h1 id='3' style='font-size:14px'>휴 INV-AJ355548</h1>\\n<h1 id='4' style='font-size:18px'>Invoice Date</h1>\\n<br><h1 id='5' style='font-size:18px'>9/7/1992</h1>\\n<h1 id='6' style='font-size:16px'>Mamo<br>Lucy Park</h1>\\n<h1 id='7' style='font-size:18px'>Address</h1>\\n<br><h1 id='8' style='font-size:16px'>7 Pepper Wood Street, 130 Stone Comer<br>Terrace<br>Wilkes Barre, Pennsylvania, 18768<br>United States</h1>\\n<h1 id='9' style='font-size:16px'>Email</h1>\\n<br><h1 id='10' style='font-size:16px'>Ikitchenman0@arizona.edu</h1>\\n<br><h1 id='11' style='font-size:20px'>Service Details Form</h1>\\n<h1 id='12' style='font-size:16px'>Name<br>Sung Kim</h1>\\n<h1 id='13' style='font-size:16px'>260 'ess<br>Gwangovolungang:co 338, Gyeongg do.<br>Sanghyeon-dong, Sui-gu<br>Yongin-si, South Korea</h1>\\n<h1 id='14' style='font-size:18px'>Additional Request</h1>\\n<br><p id='15' data-category='paragraph' style='font-size:14px'>Vivamus vestibulum sagittis sapien. Cum sociis natoque<br>penatibus 항목 magnis dfs parturient montes, nascetur ridiculus<br>mus.</p>\\n<h1 id='16' style='font-size:14px'>TERMS AND CONDITIONS</h1>\\n<p id='17' data-category='list' style='font-size:14px'>L TM Seir that not be lable 1층 the Buyer drectly indirectly for any loun or damage sufflered by 전액 Buyer<br>2. The 별 www. the product for ore 과 관한 from the date 설 shipment.<br>3. Any ourchase order received by ~ sele - be interpreted 추가 accepting the offer Ma the 18% offer writing The buyer may<br>purchase 15 The offer My the Terms and Conditions the Seller included The offer</p>",
"markdown": "",
"text": ""
},
"elements": [
{
"category": "heading1",
"content": {
"html": "<h1 id='0' style='font-size:22px'>INVOICE</h1>",
"markdown": "",
"text": ""
},
"coordinates": [
{
"x": 0.0648,
"y": 0.0517
}, ...
}
],
"id": 0,
"page": 1
},
...
⏳ Asynchronous API
In asynchronous mode, the API immediately returns a request_id
, and the results can be retrieved later.
Think of this like ordering take-away food and receiving a notification when your food is ready.
Key Characteristics:
Supports up to 1,000 pages
Returns
request_id
instantlyIdeal for large-scale batch processing
Step-by-Step Flow:
Send request → receive
request_id
Use the status check API to monitor progress
Retrieve the final result from
download_url
Python Example:
Sending an asynchronous request
# 1. Sending an asynchronous request
import requests
api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png
url = "<https://api.upstage.ai/v1/document-digitization/async>"
headers = {"Authorization": f"Bearer {api_key}"}
files={"document": open(filename, "rb")},
data = {"model": "document-parse"}
response = requests.post(url, headers=headers, files=files, data = data)
print(response.json())
# {"request_id": "e7b1..."}
Check request results with
request_id
import requests
api_key = "UPSTAGE_API_KEY"
request_id = "enter_request_id_you_recieved" # e.g. e7b1b3b0-1b3b-...
url = f"<https://api.upstage.ai/v1/document-digitization/requests/{request_id}>"
headers = {"Authorization": f"Bearer {api_key}"}
response = requests.get(url, headers=headers)
result = response.json()
Expected Output:
{
"id": "e7b1b3b0-1b3b-4b3b-8b3b-1b3b3b3b3b3b",
"status": "completed",
"model": "document-parse",
"failure_message": "",
"total_pages": 28,
"completed_pages": 28,
"batches": [
{
"id": 0,
"model": "document-parse-240910",
"status": "completed",
"failure_message": "",
"download_url": "<https://download-url>",
"start_page": 1,
"end_page": 10,
"requested_at": "2024-07-01T14:47:01.863880448Z",
"updated_at": "2024-07-01T14:47:15.901662097Z"
},
...
4. What is Document OCR?
Document OCR (Optical Character Recognition) is a technology that extracts text from images of documents.
Upstage Document OCR delivers fast and accurate recognition across various document formats.
When to Use OCR?
When you need just the text, not the layout or structure
For scanned images or photos
For simple preprocessing in automation pipelines
Example Use Cases:
Extracting names and ID numbers from scanned identity cards
OCR key points from whiteboard snapshots
4.1. Demo: Handwritten letter translator using Document OCR
In this demo, users can upload an image of a handwritten Korean letter. The system uses Upstage Document OCR API to extract the text, and Solar LLM to translate the content into English.
📩 Korean Handwriting Translator

DEMO LINK 🔗Upload your handwritten letters and try out Document OCR in action!
📩 The sample letter used in this demo is an actual handwritten note, recreated using generative AI based on the original from financial news from Korea.
✨ Key Features
Extracts handwritten text from images using Upstage Document OCR
Translates Korean to English using Solar LLM
🖥️ Example Code
Full implementation and details:
⚡️ Need quick text extraction from documents? Try Document OCR for lightweight and fast results!
4.2. Document OCR - Input, Output Format
📥 Input Requirements
Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
Max file size: 50MB
Max page count: 30 pages
Max pixels per page: 100 million (based on 150 DPI)
Supported languages: Korean, English, Chinese (Hanzi)
Text size condition: Optimized for text occupying ≤30% of the page. Larger text blocks may reduce accuracy.
📤 Output Structure
{
"apiVersion": "1.1",
"modelVersion": "ocr-2.2.1",
"pages": [
{ "page": 1,
"text": "Print the words \\\\nhello, world",
"confidence": 0.99,
"words": [
{
"text": "hello",
"boundingBox": {
"vertices": [
{ "x": 65, "y": 52 },
{ "x": 221, "y": 55 },
{ "x": 221, "y": 104 },
{ "x": 65, "y": 104 } ]}}]}]}
Each recognized word includes:
text
: the extracted stringconfidence
: recognition score (0.0–1.0)boundingBox
: word position in pixel coordinates
▶️ OCR Robustness
Upstage OCR remains accurate even under challenging conditions:
Rotated or skewed text
Background watermarks or checkboxes
Low-quality scans or document noise
The model precisely detects text boxes and filters out irrelevant elements like watermarks.
▶️ Confidence Score
Each word includes a confidence score based on character-level recognition.
Use this to:
Filter out low-confidence outputs
Prompt users to verify uncertain sections
4.3. Getting Started with Document OCR API
Here’s a simple example to get you started:
# pip install requests
import requests
api_key = "UPSTAGE_API_KEY" # ex: up_xxxYYYzzzAAAbbbCCC
filename = "your_file.pdf" # ex: ./image.png
url = "<https://api.upstage.ai/v1/document-digitization>"
headers = {"Authorization": f"Bearer {api_key}"}
files = {"document": open(filename, "rb")}
data = {"model": "ocr"}
response = requests.post(url, headers=headers, files=files, data = data)
print(response.json())
Sample response:
{ "apiVersion": "1.1", "modelVersion": "ocr-2.2.1", "pages": [ { "page": 1, "text": "Print the words \\nhello, world", "confidence": 0.99, "words": [ { "text": "hello", "boundingBox": { "vertices": [ { "x": 65, "y": 52 }, { "x": 221, "y": 55 }, { "x": 221, "y": 104 }, { "x": 65, "y": 104 } ]}}]}]}
WrapUp
Let’s summarize what we’ve learned:
🔹 What is Document Digitization?
The process of converting documents into machine-readable formats (HTML, Markdown, etc.) so AI systems can understand and use them for tasks like search, summarization, and Q&A. It’s the first step in document-based AI workflows.
🔹 Why is Document Digitization Important?
Most documents contain structured, visual elements like tables, charts, and headings.
LLMs struggle with unstructured content, so digitizing and structuring documents enables integration into automated pipelines and intelligent services.
🔹 OCR vs Document Parse API
Goal
Extract text quickly
Structure layout for LLM input
Input Type
Scanned images, photos
PDFs, Office docs, scanned files
Output
Plain text with coordinates
HTML, Markdown, and structure
Strength
Fast preprocessing
Deep document understanding
YoungHoon Jeon | AI Edu | Upstage
Last updated