Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.trueparser.com/llms.txt

Use this file to discover all available pages before exploring further.

TrueParser provides a unified API for extracting structured JSON from CAD, GIS, SQL, email archives, and enterprise files while giving tenants a control plane for apps, plans, regions, storage, and demos.

Why TrueParser?

Unified API

One endpoint for all your parsing needs, regardless of the underlying file format.

Streaming-First

Stream-based processing keeps memory usage low even for very large documents.

Format Agnostic

Built-in detection and routing for CAD, SQL, PDF, and enterprise Office formats.

Control Plane

The dashboard is where tenant teams manage access and setup.
  • create tenant apps and credentials
  • assign plans and billing state
  • select a license region
  • configure optional tenant-owned storage
  • use the Demo UI for manual testing

Format Universe

TrueParser orchestrates specialized engines to deliver deep extraction capabilities across technical domains.

GIS & Geospatial

Preserve geometry (points, lines, polygons) and spatial metadata (CRS, attributes).

CAD & Engineering

DWG and DXF. Preserve entities, layers, blocks, and attributes in structured JSON.

SQL & Database

Multi-dialect support for PostgreSQL, T-SQL, Snowflake, BigQuery, and 10+ others.

Email & Archives

PST, OST, EML, and MSG. Extract body, metadata, and attachments.

Enterprise Documents

Microsoft Office (Word, Excel, PowerPoint) and OpenDocument (ODF) support.

Technical PDF

Structured extraction from complex PDF documents with semantic awareness.

Parquet Datasets

Columnar extraction with metadata-first or metadata-plus-rows modes.

Core Capabilities

TrueParser handles the cross-cutting concerns that modern data pipelines require, allowing you to focus on the extracted data rather than the parsing mechanics.

High-Performance Architecture

Designed for high-volume forensics and enterprise ingestion:
  • Zero-Disk Buffering: Direct memory-to-stream processing reduces I/O overhead.
  • Lazy Attachment Loading: Large email attachments are only read when explicitly requested, minimizing memory footprint.
  • High Throughput: Engines like MailKit and GIS target metadata extraction rates exceeding 400 MB/s.

Normalized Spatial Contracts

TrueParser unifies technical geometry across disparate engines. Whether you are parsing a complex CAD drawing or a multi-layer GIS archive, the platform delivers a consistent spatial contract:
  • WKT (Well-Known Text): All geometry is normalized into standard WKT for easy database and application ingestion.
  • BBOX & Centroids: Every spatial record includes pre-calculated Bounding Box and Centroid metadata.
  • World-Coordinate Flattening: CAD blocks and GIS layers are projected into a unified coordinate space for direct interoperability.

Result Materialization

The platform transforms parsed output into final, canonical JSON artifacts stored in managed storage or your tenant-scoped S3-compatible storage, depending on how the app is configured.

Document Units

TrueParser measures parsing consumption in Document Units. A Document Unit is the commercial usage unit used for plan entitlements and quota enforcement. Depending on the parser family, one Document Unit may correspond to:
  • one page for PDF, Microsoft Office, and OpenDocument files
  • one email or message item for MailKit sources
  • one SQL statement for SQL sources
  • one logical dataset for GIS sources
  • one processed document for CAD sources

Governance & Security

Built for enterprise use with:
  • Multi-tenant isolation: Logical and physical separation of tenant data.
  • JWT-based auth: Secure machine-to-machine authentication.
  • Resource Controls: Granular rate limiting and quotas.

Support Matrix

Below is a detailed view of the formats supported by the TrueParser engine library.

PDF (TrueParserPdf)

TrueParserPdf supports three PDF modes:
  • Basic Single Column: Optimized for standard text and table extraction in single-column layouts.
  • Basic Multi Column: Optimized for multi-column reading flows in standard digital PDFs.
  • Advanced: OCR for scanned PDFs and OCR-aware extraction for complex digital layouts. Advanced runs are limited to 100 pages per run in beta.
Beta limits for PDF processing:
  • Maximum file size: 25 MB
  • Advanced mode page limit: 100 pages per run
Output is returned as canonical JSON for product retrieval and downstream use.

Parquet (TrueParserParquet)

  • Modes: MetadataOnly and MetadataPlusRows.
  • Default: If no mode is provided, ingestion defaults to MetadataOnly.
  • Use Cases: Fast schema validation, profile generation, and row-level extraction for downstream analytics.

Messaging & Archives (TrueParserMailKit)

Optimized for high-volume forensics and large-scale migrations, the MailKit engine provides exhaustive coverage for PST, OST, MSG, EML, MBOX, and MHT formats. TrueParser delivers deep extraction for rich structured items including Calendar events, Appointments, Contacts, Tasks, and Distribution Lists. The engine handles complex Outlook-specific encapsulations like TNEF (winmail.dat) and provides cryptographic unwrapping for S/MIME (.p7s/.p7m) signed or encrypted messages, preserving all underlying metadata.
Enterprise Access: Rich item extraction (Calendar, Tasks, Contacts) is available on all plans. However, binary attachment extraction is exclusively supported on Enterprise-tier plans.

GIS & Geospatial (TrueParserGis)

Stateless, stream-first vector parsing engine:
  • Spatial Extraction: Normalizes geometry, layers, and features into WKT, BBOX, and Centroid.
  • Contracts: Native support for zipped uploads (SHP_ZIP, FileGDB.zip, MapInfo.zip).
  • Throughput: Targets 1,000+ features/sec for structured spatial ingestion.
  • Formats: SHP, GeoJSON, GPKG, KML, KMZ, GML, FGB, FileGDB, MapInfo, SpatiaLite, and more.

CAD & Engineering (TrueParserCad)

Developer-friendly CAD-to-JSON conversion:
  • Geometry Extraction: Normalizes 30+ CAD entities (LINE, POLYLINE, HATCH, etc.) into structured geometry.
  • Normalization: Flattens DWG/DXF blocks into world coordinates with bbox, centroid, and WKT.
  • Hierarchy: Maintains drawing relationships with containerId and parentId for deep traceability.
  • Spatial Transformation: Automatic 3D-to-2D projections for spatial fields.

SQL (TrueParserSql)

  • Dialects: ANSI, BigQuery, ClickHouse, Databricks, DuckDB, MS SQL Server, MySQL, Oracle, PostgreSQL, Redshift, Snowflake, SQLite.
  • Advanced: BOM-aware ingestion, malformed statement recovery, and semantic extraction for join/filter analysis.

Enterprise & OpenDocument (TrueParserMsOffice / TrueParserOpenDoc)

  • Microsoft Office: .docx, .doc, .rtf, .xlsx, .xls, .pptx, .ppt, and web formats (.html, .md).
  • OpenDocument (ODF): .odt, .ods, .odp, .odg, .odf.
  • Technical Formats: EPUB, IDML (InDesign), DBF, DIF, and MIF (FrameMaker).

Use Cases

RAG & Semantic Search

Power your LLM pipelines with structured, meaningful context from technical documents.

Compliance & Audit

Automate sensitive data discovery and extraction across large archive sets.

Analytics & Lineage

Track SQL statement lineage and extract business logic for data governance.

GIS Data Ingestion

Streamline spatial data into modern web applications and spatial databases.

Next Steps

Quickstart

Get your first document parsed in minutes.

Dashboard Overview

Learn how tenants, apps, plans, and regions fit together.

Demo UI

Try the tenant-facing test workspace with one app.
Last modified on April 28, 2026