Frontend Guide
Parser options
Every Ragify job accepts a JSON options object that controls how the PDF is parsed. In the web UI, these are exposed as controls in the Show parser options panel. Via API, pass them as the options form field.
# All options as JSON in the options form field
curl -X POST https://api.ragify.it/jobs \
-H "X-Api-Key: rg_..." \
-F "file=@doc.pdf" \
-F 'options={
"format": ["markdown", "json"],
"reading_order": "xycut",
"table_method": "cluster",
"pages": "1-10",
"sanitize": true
}'◆ Note
Output format
| Parameter | Type | Default | Description |
|---|---|---|---|
| format* | string[] | ["markdown"] | Array of output formats to produce. Values: "markdown", "json", "html", "text", "tagged-pdf", "annotated-pdf". Free accounts limited to ["markdown","text"]. |
Reading order
Controls how text blocks are ordered when the visual layout is ambiguous (e.g. multi-column, sidebar + main text).
| Parameter | Type | Default | Description |
|---|---|---|---|
| reading_order | string | "xycut" | Algorithm used to determine reading order. "xycut" (default) uses recursive XY-cut splitting — excellent for academic papers and multi-column layouts. "off" preserves the raw PDF object order (faster but may produce incorrect ordering for complex layouts). |
xycutRecommendedRecursive spatial partitioning. Best for multi-column documents, academic papers, magazines. Adds a few milliseconds.
offUses the raw PDF draw order. Fastest. Best for simple single-column documents where the PDF order matches reading order.
Table detection
Controls which algorithm is used to identify and extract tables from the PDF.
| Parameter | Type | Default | Description |
|---|---|---|---|
| table_method | string | "default" | "default" uses border-based detection (fast, works for tables with visible cell borders). "cluster" uses spatial clustering (slower, works for borderless/headerless tables). Use "cluster" for financial reports and spreadsheet exports. |
defaultBorder-based. Detects cells by their visible lines/borders. Fast. Recommended for forms and tables with explicit borders.
clusterCluster-based. Groups text spatially to infer table structure without needing visible borders. Required for plain spreadsheet exports.
✦ Tip
table_method: "cluster" with Hybrid AI mode. The AI backend achieves +90% accuracy on borderless tables compared to the default engine.Image extraction
Pro and Business accounts can extract images from PDFs alongside the text output.
| Parameter | Type | Default | Description |
|---|---|---|---|
| image_output | string | "off" | "off" — no images extracted. "external" — images saved as separate files, packaged in a ZIP. "embedded" — images base64-encoded inside the HTML output (requires html format). |
| image_format | string | "png" | Format for extracted images. "png" (lossless, larger) or "jpeg" (lossy, smaller). Applies only when image_output != "off". |
⚠ Warning
image_output: "embedded" requires "html" to be in your format array. If you select embedded without HTML, the system automatically falls back to external.Page selection
| Parameter | Type | Default | Description |
|---|---|---|---|
| pages | string | "" | Comma-separated page numbers or ranges to parse. Examples: "1", "1,3,5", "1-5", "1-3,7-10". Empty string (default) parses all pages. Only digits, commas, and hyphens are accepted. |
# Parse only the first 5 pages
-F 'options={"pages":"1-5"}'
# Parse pages 1, 3, and 7 through 12
-F 'options={"pages":"1,3,7-12"}'Password-protected PDFs
| Parameter | Type | Default | Description |
|---|---|---|---|
| password | string | "" | Password for encrypted PDFs. The password is used only during processing and is never stored in the database. Max 256 characters. |
◆ Note
null and the job will fail during processing.Text and layout options
| Parameter | Type | Default | Description |
|---|---|---|---|
| include_header_footer | boolean | false | Include running headers and footers (page numbers, document title, etc.) in the output. Disabled by default as they often add noise to RAG pipelines. |
| sanitize | boolean | false | Automatically redact PII: email addresses, phone numbers, IP addresses, credit card numbers, and national ID patterns. Replaced with [REDACTED] in the output. |
| keep_line_breaks | boolean | false | Preserve original line breaks from the PDF. By default, lines within the same paragraph are joined. Enable for poetry, legal clauses, or content where line breaks are meaningful. |
| detect_strikethrough | boolean | false | Detect strikethrough text and mark it in Markdown output with ~~text~~. Useful for redlined legal documents or tracked-changes exports. |
| use_struct_tree | boolean | false | Read the accessibility tag tree embedded by the authoring tool (Word, InDesign) instead of inferring structure visually. More accurate for tagged PDF/UA documents. Falls back to visual detection if no tags are found. |
| replace_invalid_chars | string | "" | Single character to substitute for unrecognised glyphs (corrupt fonts, unsupported encodings). Empty string (default) replaces with a space. Use "?" or "_" to make substitutions visible. |
Hybrid AI options
These options control the optional AI backend. See the Hybrid AI mode page for full details.
| Parameter | Type | Default | Description |
|---|---|---|---|
| hybrid | string | "" | Set to "docling-fast" to enable Hybrid AI routing. Pro/Business only. Empty string disables Hybrid. |
| hybrid_mode | string | "" | "" — standard Hybrid (tables + OCR). "full" — includes AI picture descriptions (SmolVLM). Slower but adds natural-language captions to images in the JSON output. |
| force_ocr | boolean | false | Force OCR processing on every page, even pages that already have selectable text. Use for scanned PDFs where the text layer is corrupt or empty. Auto-enables Hybrid mode if not already set. |
| hybrid_fallback | boolean | true | If the Hybrid AI backend is unreachable or returns an error, automatically fall back to the standard fast engine. Recommended to keep true in production. |
| hybrid_timeout | string | "0" | Timeout in milliseconds for the Hybrid backend response per page. "0" means no timeout. Increase for very large or complex pages. |
Complete example
A job submitting a financial report with all recommended options for maximum accuracy:
curl -X POST https://api.ragify.it/jobs \
-H "X-Api-Key: rg_your_key" \
-F "file=@financial_report_q3.pdf" \
-F 'options={
"format": ["markdown", "json"],
"reading_order": "xycut",
"table_method": "cluster",
"hybrid": "docling-fast",
"hybrid_fallback": true,
"pages": "",
"include_header_footer": false,
"sanitize": false,
"image_output": "external",
"image_format": "png"
}'