Frontend Guide

Parser options

Every Ragify job accepts a JSON options object that controls how the PDF is parsed. In the web UI, these are exposed as controls in the Show parser options panel. Via API, pass them as the options form field.

bash
# All options as JSON in the options form field
curl -X POST https://api.ragify.it/jobs \
  -H "X-Api-Key: rg_..." \
  -F "file=@doc.pdf" \
  -F 'options={
    "format": ["markdown", "json"],
    "reading_order": "xycut",
    "table_method": "cluster",
    "pages": "1-10",
    "sanitize": true
  }'

Note

Options not specified default to the values shown below. Unknown keys are rejected with HTTP 400.

Output format

ParameterTypeDefaultDescription
format*string[]["markdown"]Array of output formats to produce. Values: "markdown", "json", "html", "text", "tagged-pdf", "annotated-pdf". Free accounts limited to ["markdown","text"].

Reading order

Controls how text blocks are ordered when the visual layout is ambiguous (e.g. multi-column, sidebar + main text).

ParameterTypeDefaultDescription
reading_orderstring"xycut"Algorithm used to determine reading order. "xycut" (default) uses recursive XY-cut splitting — excellent for academic papers and multi-column layouts. "off" preserves the raw PDF object order (faster but may produce incorrect ordering for complex layouts).
xycutRecommended

Recursive spatial partitioning. Best for multi-column documents, academic papers, magazines. Adds a few milliseconds.

off

Uses the raw PDF draw order. Fastest. Best for simple single-column documents where the PDF order matches reading order.

Table detection

Controls which algorithm is used to identify and extract tables from the PDF.

ParameterTypeDefaultDescription
table_methodstring"default""default" uses border-based detection (fast, works for tables with visible cell borders). "cluster" uses spatial clustering (slower, works for borderless/headerless tables). Use "cluster" for financial reports and spreadsheet exports.
default

Border-based. Detects cells by their visible lines/borders. Fast. Recommended for forms and tables with explicit borders.

cluster

Cluster-based. Groups text spatially to infer table structure without needing visible borders. Required for plain spreadsheet exports.

Tip

For maximum table accuracy on complex documents, combine table_method: "cluster" with Hybrid AI mode. The AI backend achieves +90% accuracy on borderless tables compared to the default engine.

Image extraction

Pro and Business accounts can extract images from PDFs alongside the text output.

ParameterTypeDefaultDescription
image_outputstring"off""off" — no images extracted. "external" — images saved as separate files, packaged in a ZIP. "embedded" — images base64-encoded inside the HTML output (requires html format).
image_formatstring"png"Format for extracted images. "png" (lossless, larger) or "jpeg" (lossy, smaller). Applies only when image_output != "off".

Warning

image_output: "embedded" requires "html" to be in your format array. If you select embedded without HTML, the system automatically falls back to external.

Page selection

ParameterTypeDefaultDescription
pagesstring""Comma-separated page numbers or ranges to parse. Examples: "1", "1,3,5", "1-5", "1-3,7-10". Empty string (default) parses all pages. Only digits, commas, and hyphens are accepted.
bash
# Parse only the first 5 pages
-F 'options={"pages":"1-5"}'

# Parse pages 1, 3, and 7 through 12
-F 'options={"pages":"1,3,7-12"}'

Password-protected PDFs

ParameterTypeDefaultDescription
passwordstring""Password for encrypted PDFs. The password is used only during processing and is never stored in the database. Max 256 characters.

Note

If a PDF is password-protected and no password is provided, the page count check will returnnull and the job will fail during processing.

Text and layout options

ParameterTypeDefaultDescription
include_header_footerbooleanfalseInclude running headers and footers (page numbers, document title, etc.) in the output. Disabled by default as they often add noise to RAG pipelines.
sanitizebooleanfalseAutomatically redact PII: email addresses, phone numbers, IP addresses, credit card numbers, and national ID patterns. Replaced with [REDACTED] in the output.
keep_line_breaksbooleanfalsePreserve original line breaks from the PDF. By default, lines within the same paragraph are joined. Enable for poetry, legal clauses, or content where line breaks are meaningful.
detect_strikethroughbooleanfalseDetect strikethrough text and mark it in Markdown output with ~~text~~. Useful for redlined legal documents or tracked-changes exports.
use_struct_treebooleanfalseRead the accessibility tag tree embedded by the authoring tool (Word, InDesign) instead of inferring structure visually. More accurate for tagged PDF/UA documents. Falls back to visual detection if no tags are found.
replace_invalid_charsstring""Single character to substitute for unrecognised glyphs (corrupt fonts, unsupported encodings). Empty string (default) replaces with a space. Use "?" or "_" to make substitutions visible.

Hybrid AI options

These options control the optional AI backend. See the Hybrid AI mode page for full details.

ParameterTypeDefaultDescription
hybridstring""Set to "docling-fast" to enable Hybrid AI routing. Pro/Business only. Empty string disables Hybrid.
hybrid_modestring"""" — standard Hybrid (tables + OCR). "full" — includes AI picture descriptions (SmolVLM). Slower but adds natural-language captions to images in the JSON output.
force_ocrbooleanfalseForce OCR processing on every page, even pages that already have selectable text. Use for scanned PDFs where the text layer is corrupt or empty. Auto-enables Hybrid mode if not already set.
hybrid_fallbackbooleantrueIf the Hybrid AI backend is unreachable or returns an error, automatically fall back to the standard fast engine. Recommended to keep true in production.
hybrid_timeoutstring"0"Timeout in milliseconds for the Hybrid backend response per page. "0" means no timeout. Increase for very large or complex pages.

Complete example

A job submitting a financial report with all recommended options for maximum accuracy:

bash
curl -X POST https://api.ragify.it/jobs \
  -H "X-Api-Key: rg_your_key" \
  -F "file=@financial_report_q3.pdf" \
  -F 'options={
    "format": ["markdown", "json"],
    "reading_order": "xycut",
    "table_method": "cluster",
    "hybrid": "docling-fast",
    "hybrid_fallback": true,
    "pages": "",
    "include_header_footer": false,
    "sanitize": false,
    "image_output": "external",
    "image_format": "png"
  }'