Getting Started with Document Digitization
Last updated
Last updated
Document digitization refers to converting documents into machine-readable formats, such as text, HTML, or Markdown. These digitized documents can then be used for various AI-driven tasks such as information retrieval, summarization, and data extraction.
Extracts and structures both text and layout elements (e.g., paragraphs, tables, images) into HTML or Markdown.
Transforms documents into formats that are easily understandable by LLMs, supporting various downstream tasks.
Includes OCR capabilities but goes beyond traditional OCR by preserving higher-level structural information.
Extracts only text and positional information, best suited for basic text recognition tasks.
👉 While Document Parsing uses OCR under the hood, it also includes advanced layout detection and table/chart recognition, making it significantly more powerful. If you only need fast text extraction, use Document OCR. Document parsing is recommended for structured data.
Document automation goes beyond simply extracting text. Its goal is to structure documents so that AI can understand and process them.
While the console demo is excellent for testing and learning, workflows such as service integration, automation pipelines, and production applications require API use.
✅ Console Demo: Manual upload and testing
✅ API: Systematic, automated document ingestion and processing in production
Document Digitization is a key preprocessing step for LLM pipelines.
For LLM-based applications like chatbots, search, or summarization, documents must first be segmented into paragraphs, tables, images, etc., and their structure must be recognized. Document parsing is crucial to ensure the LLM can accurately interpret the document.
Most LLM pipelines operate in backend systems. API integration is essential for processing documents in real time.
Use Cases:
Building a RAG system that splits patent documents into paragraphs, then retrieves relevant content based on user queries.
Parsing academic papers to power apps with summarization and highlighting features.
Let’s dive deeper into Document Parse, the engine that powers such workflows.
Upstage Document Parse automatically converts various types of documents into structured HTML. It detects layout elements such as paragraphs, tables, images, formulas, and charts, and serializes them in a logical reading order for LLMs to consume.
Before we explore the technical details, check out this demo:
This chatbot uses Document Parse to convert a financial statement into HTML, enabling free-form Q&A based on the content.
📄 Financial Statement Q&A Chatbot
✨ Key Features
Converts financial documents into HTML using Document Parse API
Enables document-based Q&A using Solar LLM
🖥️ Example Code
Complete source code and explanation:
⚡️ Document Parse becomes the "eyes" of the LLM, enabling it to understand complex documents and generate accurate responses
Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
Max file size: 50MB
Max page count:
Synchronous API: up to 100 pages
Asynchronous API: up to 1,000 pages
Max pixels per page: 100 million pixels (based on 150 DPI conversion)
Supported languages (for OCR): Korean, English, Numbers
Chinese (Hanzi-beta), Japanese (Kanji–beta),
Documents include more than plain text—they contain headings, tables, charts, images, and more. The structure is preserved by converting these elements into HTML, allowing LLMs to understand the content accurately.
▶️ Layout Categories & HTML Tags
Document Parse uses HTML tags to express the layout of each document component:
table
<table>...</table>
figure
<figure><img>...</img></figure>
chart
<figure><img data-category="chart">...</img></figure>
heading1
<h1>...</h1>
paragraph
<p data-category="paragraph">...</p>
equation
<p data-category="equation">$$...$$</p>
list
<p data-category="list">...</p>
Other elements like headers, footers, captions, indexes, and footnotes are also recognized and appropriately tagged or assigned data-category
attributes
▶️ Chart Recognition
Charts often embedded as images are interpreted and converted into a table format for further use.
Supported chart types: Bar, Line, Pie
Embedded in HTML as:
<figure data-category="chart"><table>...</table></figure>
▶️ Equation Recognition
Mathematical expressions are converted to LaTeX and wrapped with:
You can render these on web interfaces using libraries like MathJax.
▶️ Coordinates (Relative Position)
Each layout element includes position metadata using relative coordinates (0–1), which can be used for cropping or UI visualization.
For example:
Depending on your use case, the API has two ways: 1/ synchronous and 2/ asynchronous.
In synchronous mode, the API waits for the process to finish and returns the result immediately.
Think of it as ordering food and waiting at the restaurant for your dish.
Key Characteristics:
Supports up to 100 pages
Real-time response
Ideal for testing or low-latency needs
Python Example:
Expected Output:
In asynchronous mode, the API immediately returns a request_id
, and the results can be retrieved later.
Think of this like ordering take-away food and receiving a notification when your food is ready.
Key Characteristics:
Supports up to 1,000 pages
Returns request_id
instantly
Ideal for large-scale batch processing
Step-by-Step Flow:
Send request → receive request_id
Use the status check API to monitor progress
Retrieve the final result from download_url
Python Example:
Sending an asynchronous request
Check request results with request_id
Expected Output:
Document OCR (Optical Character Recognition) is a technology that extracts text from images of documents.
Upstage Document OCR delivers fast and accurate recognition across various document formats.
When to Use OCR?
When you need just the text, not the layout or structure
For scanned images or photos
For simple preprocessing in automation pipelines
Example Use Cases:
Extracting names and ID numbers from scanned identity cards
OCR key points from whiteboard snapshots
In this demo, users can upload an image of a handwritten Korean letter. The system uses Upstage Document OCR API to extract the text, and Solar LLM to translate the content into English.
📩 Korean Handwriting Translator
✨ Key Features
Extracts handwritten text from images using Upstage Document OCR
Translates Korean to English using Solar LLM
🖥️ Example Code
Full implementation and details:
⚡️ Need quick text extraction from documents? Try Document OCR for lightweight and fast results!
Supported file formats: JPEG, PNG, BMP, PDF, TIFF, HEIC, DOCX, PPTX, XLSX
Max file size: 50MB
Max page count: 30 pages
Max pixels per page: 100 million (based on 150 DPI)
Supported languages: Korean, English, Chinese (Hanzi)
Text size condition: Optimized for text occupying ≤30% of the page. Larger text blocks may reduce accuracy.
Each recognized word includes:
text
: the extracted string
confidence
: recognition score (0.0–1.0)
boundingBox
: word position in pixel coordinates
▶️ OCR Robustness
Upstage OCR remains accurate even under challenging conditions:
Rotated or skewed text
Background watermarks or checkboxes
Low-quality scans or document noise
The model precisely detects text boxes and filters out irrelevant elements like watermarks.
▶️ Confidence Score
Each word includes a confidence score based on character-level recognition.
Use this to:
Filter out low-confidence outputs
Prompt users to verify uncertain sections
Here’s a simple example to get you started:
Sample response:
Let’s summarize what we’ve learned:
🔹 What is Document Digitization?
The process of converting documents into machine-readable formats (HTML, Markdown, etc.) so AI systems can understand and use them for tasks like search, summarization, and Q&A. It’s the first step in document-based AI workflows.
🔹 Why is Document Digitization Important?
Most documents contain structured, visual elements like tables, charts, and headings.
LLMs struggle with unstructured content, so digitizing and structuring documents enables integration into automated pipelines and intelligent services.
🔹 OCR vs Document Parse API
Goal
Extract text quickly
Structure layout for LLM input
Input Type
Scanned images, photos
PDFs, Office docs, scanned files
Output
Plain text with coordinates
HTML, Markdown, and structure
Strength
Fast preprocessing
Deep document understanding
YoungHoon Jeon | AI Edu | Upstage
If you are participating in the AI Initiative program, you can access Document Parsing for free until March 31st, 2026. Apply here:
🔗
🔗Upload your handwritten letters and try out Document OCR in action!
📩 The sample letter used in this demo is an actual handwritten note, recreated using generative AI based on the original from .
🔗