Frontend Guide

Parser options

Every Ragify job accepts a JSON options object that controls how the PDF is parsed. In the web UI, these are exposed as controls in the Show parser options panel. Via API, pass them as the options form field.

bash

# All options as JSON in the options form field
curl -X POST https://api.ragify.it/jobs \
  -H "X-Api-Key: rg_..." \
  -F "file=@doc.pdf" \
  -F 'options={
    "format": ["markdown", "json"],
    "reading_order": "xycut",
    "table_method": "cluster",
    "pages": "1-10",
    "sanitize": true
  }'

◆ Note

Options not specified default to the values shown below. Unknown keys are rejected with HTTP 400.

Output format

Parameter	Type	Default	Description
format*	string[]	["markdown"]	Array of output formats to produce. Values: "markdown", "json", "html", "text", "tagged-pdf", "annotated-pdf". Free accounts limited to ["markdown","text"].

Reading order

Controls how text blocks are ordered when the visual layout is ambiguous (e.g. multi-column, sidebar + main text).

Parameter	Type	Default	Description
reading_order	string	"xycut"	Algorithm used to determine reading order. "xycut" (default) uses recursive XY-cut splitting — excellent for academic papers and multi-column layouts. "off" preserves the raw PDF object order (faster but may produce incorrect ordering for complex layouts).

xycutRecommended

Recursive spatial partitioning. Best for multi-column documents, academic papers, magazines. Adds a few milliseconds.

off

Uses the raw PDF draw order. Fastest. Best for simple single-column documents where the PDF order matches reading order.

Table detection

Controls which algorithm is used to identify and extract tables from the PDF.

Parameter	Type	Default	Description
table_method	string	"default"	"default" uses border-based detection (fast, works for tables with visible cell borders). "cluster" uses spatial clustering (slower, works for borderless/headerless tables). Use "cluster" for financial reports and spreadsheet exports.

default

Border-based. Detects cells by their visible lines/borders. Fast. Recommended for forms and tables with explicit borders.

cluster

Cluster-based. Groups text spatially to infer table structure without needing visible borders. Required for plain spreadsheet exports.

✦ Tip

For maximum table accuracy on complex documents, combine table_method: "cluster" with Hybrid AI mode. The AI backend achieves +90% accuracy on borderless tables compared to the default engine.

Image extraction

Pro and Business accounts can extract images from PDFs alongside the text output.

Parameter	Type	Default	Description
image_output	string	"off"	"off" — no images extracted. "external" — images saved as separate files, packaged in a ZIP. "embedded" — images base64-encoded inside the HTML output (requires html format).
image_format	string	"png"	Format for extracted images. "png" (lossless, larger) or "jpeg" (lossy, smaller). Applies only when image_output != "off".

⚠ Warning

image_output: "embedded" requires "html" to be in your format array. If you select embedded without HTML, the system automatically falls back to external.

Page selection

Parameter	Type	Default	Description
pages	string	""	Comma-separated page numbers or ranges to parse. Examples: "1", "1,3,5", "1-5", "1-3,7-10". Empty string (default) parses all pages. Only digits, commas, and hyphens are accepted.

bash

# Parse only the first 5 pages
-F 'options={"pages":"1-5"}'

# Parse pages 1, 3, and 7 through 12
-F 'options={"pages":"1,3,7-12"}'

Password-protected PDFs

Parameter	Type	Default	Description
password	string	""	Password for encrypted PDFs. The password is used only during processing and is never stored in the database. Max 256 characters.

◆ Note

If a PDF is password-protected and no password is provided, the page count check will returnnull and the job will fail during processing.

Text and layout options

Parameter	Type	Default	Description
include_header_footer	boolean	false	Include running headers and footers (page numbers, document title, etc.) in the output. Disabled by default as they often add noise to RAG pipelines.
sanitize	boolean	false	Automatically redact PII: email addresses, phone numbers, IP addresses, credit card numbers, and national ID patterns. Replaced with [REDACTED] in the output.
keep_line_breaks	boolean	false	Preserve original line breaks from the PDF. By default, lines within the same paragraph are joined. Enable for poetry, legal clauses, or content where line breaks are meaningful.
detect_strikethrough	boolean	false	Detect strikethrough text and mark it in Markdown output with ~~text~~. Useful for redlined legal documents or tracked-changes exports.
use_struct_tree	boolean	false	Read the accessibility tag tree embedded by the authoring tool (Word, InDesign) instead of inferring structure visually. More accurate for tagged PDF/UA documents. Falls back to visual detection if no tags are found.
replace_invalid_chars	string	""	Single character to substitute for unrecognised glyphs (corrupt fonts, unsupported encodings). Empty string (default) replaces with a space. Use "?" or "_" to make substitutions visible.

Hybrid AI options

These options control the optional AI backend. See the Hybrid AI mode page for full details.

Parameter	Type	Default	Description
hybrid	string	""	Set to "docling-fast" to enable Hybrid AI routing. Pro/Business only. Empty string disables Hybrid.
hybrid_mode	string	""	"" — standard Hybrid (tables + OCR). "full" — includes AI picture descriptions (SmolVLM). Slower but adds natural-language captions to images in the JSON output.
force_ocr	boolean	false	Force OCR processing on every page, even pages that already have selectable text. Use for scanned PDFs where the text layer is corrupt or empty. Auto-enables Hybrid mode if not already set.
hybrid_fallback	boolean	true	If the Hybrid AI backend is unreachable or returns an error, automatically fall back to the standard fast engine. Recommended to keep true in production.
hybrid_timeout	string	"0"	Timeout in milliseconds for the Hybrid backend response per page. "0" means no timeout. Increase for very large or complex pages.

Complete example

A job submitting a financial report with all recommended options for maximum accuracy:

bash

curl -X POST https://api.ragify.it/jobs \
  -H "X-Api-Key: rg_your_key" \
  -F "file=@financial_report_q3.pdf" \
  -F 'options={
    "format": ["markdown", "json"],
    "reading_order": "xycut",
    "table_method": "cluster",
    "hybrid": "docling-fast",
    "hybrid_fallback": true,
    "pages": "",
    "include_header_footer": false,
    "sanitize": false,
    "image_output": "external",
    "image_format": "png"
  }'