Frontend Guide

Output formats

Ragify produces up to 6 output formats per job. Free accounts get Markdown and Plain Text; Pro and Business unlock all formats plus image extraction.

markdown

Free+

json

Pro+

html

Pro+

text

Free+

tagged-pdf

Pro+

annotated-pdf

Pro+

Markdown

FreeAPI key: markdown

The most versatile format. Ideal for RAG pipelines, LLM ingestion, and human review.

  • Headings mapped to #, ##, ### by hierarchy
  • Tables rendered as GFM (GitHub Flavored Markdown) pipe tables
  • Lists, bold, italic, strikethrough (if detect_strikethrough enabled)
  • Images embedded as ![](path) or base64 depending on image_output setting
  • Page breaks as horizontal rules ---
markdown
# Q3 Financial Report

## Revenue Overview

| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1 2026 | €2.1M   | +18%   |
| Q2 2026 | €2.3M   | +21%   |
| Q3 2026 | €2.4M   | +34%   |

Year-over-year growth reached **34%** driven by expansion
in the European market.

Tip

For RAG pipelines, Markdown is the recommended format. The section boundaries and heading hierarchy make chunking straightforward, and the output is human-readable for debugging.

JSON

Pro+API key: json

The richest format. Every document element is an object with type, content, page number, bounding box, font information, and hierarchy. Best for programmatic processing and data extraction.

json
{
  "number of pages": 12,
  "title": "Q3 Financial Report 2026",
  "kids": [
    {
      "type": "heading",
      "level": "Title",
      "content": "Q3 Financial Report",
      "page_number": 1,
      "bbox": [72, 54, 540, 80]
    },
    {
      "type": "table",
      "content": "Quarter | Revenue | Growth\nQ1 | €2.1M | +18%",
      "page_number": 3,
      "bbox": [72, 200, 540, 350],
      "rows": [
        ["Quarter", "Revenue", "Growth"],
        ["Q1 2026", "€2.1M", "+18%"]
      ]
    },
    {
      "type": "paragraph",
      "content": "Year-over-year growth reached 34%...",
      "page_number": 3,
      "font_size": 11,
      "font_name": "Arial"
    }
  ]
}

Element types

TypeDescription
headingSection heading with level (Title, H1–H6)
paragraphBody text block
tableTable with rows/cells, also serialised as plain text in content
list_itemBullet or numbered list item
figureImage or diagram with optional caption
formulaMathematical formula (Hybrid mode)
captionFigure or table caption
page_headerRunning header (requires include_header_footer)
page_footerRunning footer (requires include_header_footer)

HTML

Pro+API key: html

A self-contained HTML file with embedded styles. Suitable for web rendering, archiving, or feeding into HTML-aware processing pipelines.

  • Full document structure with <h1><h6>, <p>, <table>, <ul>
  • Basic CSS included inline — renders in any browser
  • If image_output=embedded is set, images are base64-encoded data URIs inside the HTML

Warning

For embedded images, you must select HTML as one of your output formats. Embedded mode is not compatible with JSON or Markdown output.

Plain Text

FreeAPI key: text

Raw text extraction with minimal formatting. Reading order is applied (xycut by default). Useful for full-text search indexing, simple NLP pipelines, or when structure is not needed.

Tip

Plain Text is the fastest format to produce and consumes the least storage. Use it when you only need the words, not the structure.

Tagged PDF

Pro+API key: tagged-pdf

A PDF with accessibility tags (PDF/UA) embedded. The document structure (headings, paragraphs, tables, reading order) is embedded as XMP metadata. Suitable for accessibility compliance, screen readers, and archiving workflows.

Annotated PDF

Pro+API key: annotated-pdf

The original PDF with bounding boxes drawn as visible annotations around each detected element. Primarily a debugging and quality-control tool — useful for verifying that the parser is correctly identifying headings, tables, and paragraphs.

Note

The colour of each bounding box corresponds to the element type (e.g. blue for headings, green for paragraphs, orange for tables). Open the output in any PDF viewer to inspect.

Image extraction (ZIP)

Pro+API key: images

When image_output is set to external or embedded, images extracted from the PDF are packaged into a ZIP archive alongside your other output files. The ZIP contains one file per image (PNG or JPEG depending on image_format).

  • external — images saved as separate files in the ZIP. References in Markdown/HTML point to relative paths.
  • embedded — images base64-encoded inside the HTML output. Requires HTML format to be selected.