Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.trueparser.com/llms.txt

Use this file to discover all available pages before exploring further.

PDF

Use this contract for PDF results. It is designed for fixed-layout document workflows where teams need search, OCR-aware extraction, policy review, contract analysis, and RAG over page-faithful content.

Supported modes

  • Basic Single Column: Standard extraction for single-column PDF layouts.
  • Basic Multi Column: Standard extraction for multi-column PDF layouts.
  • Advanced: OCR for scanned PDFs and OCR-aware extraction for complex digital layouts. Advanced runs are limited to 100 pages per run in beta.

Beta limits

  • Maximum file size: 25 MB
  • Advanced mode page limit: 100 pages per run

Top-level envelope

{
  "schema_version": "1.0",
  "document": {},
  "warnings": [],
  "content": []
}

Document fields

FieldNotes
source_fileSource file name.
formatAlways pdf.
format_familyAlways pdf.
title, author, subject, companyDocument metadata fields.
created_at, modified_atTimestamps when available.
page_countPage count when available.
source_modePublic extraction mode label.
source_engineEngine identity.

Universal content/block shape

Every public content record uses the same base shape.
FieldNotes
idStable record id.
typePublic record type.
pathPublic structural path.
parent_idParent record id.
depthStructural depth.
page_numberPage number.
orderDeterministic order.
bboxRequired bounding box.
source_refProvenance object.
is_inferredInference marker.
chunk_hintPresent only when you request it.
textSearchable text projection.
attributesPDF-specific structured data.
Common record types include paragraph, heading, list, table, image, and header_footer.

Warnings

  • warnings is always present.
  • Use plain strings.
  • Keep warnings human-readable.
  • Use warnings for client-facing notes, not for hidden parser behavior.

What clients can rely on

  • Page order stays stable.
  • Section structure stays explicit.
  • bbox and source_ref stay attached to each block.
  • attributes carries PDF-specific details.
  • The public contract does not expose internal transport or worker details.

Example

{
  "schema_version": "1.0",
  "document": {
    "source_file": "policy.pdf",
    "format": "pdf",
    "format_family": "pdf",
    "title": "Insurance Policy",
    "page_count": 2,
    "source_mode": "basic",
    "source_engine": "TrueParserPdf.Basic"
  },
  "warnings": [],
  "content": [
    {
      "id": "p1-o1-paragraph",
      "type": "heading",
      "path": ["EXECUTIVE SUMMARY"],
      "parent_id": null,
      "depth": 1,
      "page_number": 1,
      "order": 1,
      "bbox": { "x": 72, "y": 50, "width": 200, "height": 16 },
      "source_ref": {
        "page": 1,
        "block_id": "p1-o1-paragraph",
        "source_mode": "basic",
        "source_engine": "TrueParserPdf.Basic"
      },
      "is_inferred": true,
      "text": "EXECUTIVE SUMMARY",
      "attributes": {
        "level": 1
      }
    }
  ]
}
Last modified on April 28, 2026