Skip to main content
TrueParser provides a unified API for extracting structured json from CAD, GIS, SQL, email archives, and enterprise files while preserving geometry, semantics, relationships, and layout for search, compliance, and GenAI workflows.

Why TrueParser?

Unified API

One endpoint for all your parsing needs, regardless of the underlying file format.

Streaming-First

Stream-based processing keeps memory usage low even for very large documents.

Format Agnostic

Built-in detection and routing for CAD, SQL, PDF, and Enterprise Office formats.

Format Universe

TrueParser orchestrates specialized engines to deliver deep extraction capabilities across technical domains.

GIS & Geospatial

Preserve geometry (points, lines, polygons) and spatial metadata (CRS, attributes).

CAD & Engineering

DWG and DXF. Preserve entities, layers, blocks, and attributes in structured JSON.

SQL & Database

Multi-dialect support for PostgreSQL, T-SQL, Snowflake, BigQuery, and 10+ others.

Email & Archives

PST, OST, EML, and MSG. Extract body, metadata, and attachments.

Enterprise Documents

Microsoft Office (Word, Excel, PowerPoint) and OpenDocument (ODF) support.

Technical PDF

Structured extraction from complex PDF documents with semantic awareness.

Core Capabilities

TrueParser handles the cross-cutting concerns that modern data pipelines require, allowing you to focus on the extracted data rather than the parsing mechanics.

High-Performance Architecture

Designed for high-volume forensics and enterprise ingestion:
  • Zero-Disk Buffering: Direct memory-to-stream processing reduces I/O overhead.
  • Lazy Attachment Loading: Large email attachments are only read when explicitly requested, minimizing memory footprint.
  • High Throughput: Engines like MailKit and GIS target metadata extraction rates exceeding 400 MB/s.

Normalized Spatial Contracts

TrueParser unifies technical geometry across disparate engines. Whether you are parsing a complex CAD drawing or a multi-layer GIS archive, the platform delivers a consistent spatial contract:
  • WKT (Well-Known Text): All geometry is normalized into standard WKT for easy database and application ingestion.
  • BBOX & Centroids: Every spatial record includes pre-calculated Bounding Box and Centroid metadata.
  • World-Coordinate Flattening: CAD blocks and GIS layers are projected into a unified coordinate space for direct interoperability.

Result Materialization

The platform transforms parsed output into final, canonical JSON artifacts stored directly in your tenant-scoped S3-compatible storage.

Document Units

TrueParser measures parsing consumption in Document Units. A Document Unit is the commercial usage unit used for plan limits and quota enforcement. Depending on the parser family, one Document Unit may correspond to:
  • one page for PDF, Microsoft Office, and OpenDocument files
  • one email or message item for MailKit sources
  • one SQL statement for SQL sources
  • one logical dataset for GIS sources
  • one processed document for CAD sources

Governance & Security

Built for enterprise use with:
  • Multi-tenant isolation: Logical and physical separation of tenant data.
  • JWT-based auth: Secure machine-to-machine authentication.
  • Resource Controls: Granular rate limiting and quotas.

Support Matrix

Below is a detailed view of the formats supported by the TrueParser engine library.

PDF & Vision (TrueParserPdf)

TrueParserPdf uses a dual-pipeline approach for document intelligence:
  • Basic Pipeline: High-performance text and table extraction using Syncfusion and Tabula heuristics. Supports SingleColumn and MultiColumn reading flows.
  • Advanced Pipeline: Vision model-based extraction for complex layouts and OCR-required documents, normalizing provider-specific output into canonical blocks.
  • Output: Canonical JSON artifacts for product retrieval and downstream use.

Messaging & Archives (TrueParserMailKit)

Optimized for high-volume forensics and large-scale migrations, the MailKit engine provides exhaustive coverage for PST, OST, MSG, EML, MBOX, and MHT formats. TrueParser delivers deep extraction for rich structured items including Calendar events, Appointments, Contacts, Tasks, and Distribution Lists. The engine handles complex Outlook-specific encapsulations like TNEF (winmail.dat) and provides cryptographic unwrapping for S/MIME (.p7s/.p7m) signed or encrypted messages, preserving all underlying metadata.
Enterprise Access: Rich item extraction (Calendar, Tasks, Contacts) is available on all plans. However, binary attachment extraction is exclusively supported on Enterprise-tier plans.

GIS & Geospatial (TrueParserGis)

Stateless, stream-first vector parsing engine:
  • Spatial Extraction: Normalizes geometry, layers, and features into WKT, BBOX, and Centroid.
  • Contracts: Native support for zipped uploads (SHP_ZIP, FileGDB.zip, MapInfo.zip).
  • Throughput: Targets 1,000+ features/sec for structured spatial ingestion.
  • Formats: SHP, GeoJSON, GPKG, KML, KMZ, GML, FGB, FileGDB, MapInfo, SpatiaLite, and more.

CAD & Engineering (TrueParserCad)

Developer-friendly CAD-to-JSON conversion:
  • Geometry Extraction: Normalizes 30+ CAD entities (LINE, POLYLINE, HATCH, etc.) into structured geometry.
  • Normalization: Flattens DWG/DXF blocks into world coordinates with bbox, centroid, and WKT.
  • Hierarchy: Maintains drawing relationships with containerId and parentId for deep traceability.
  • Spatial Transformation: Automatic 3D-to-2D projections for spatial fields.

SQL (TrueParserSql)

  • Dialects: ANSI, BigQuery, ClickHouse, Databricks, DuckDB, MS SQL Server, MySQL, Oracle, PostgreSQL, Redshift, Snowflake, SQLite.
  • Advanced: BOM-aware ingestion, malformed statement recovery, and semantic extraction for join/filter analysis.

Enterprise & OpenDocument (TrueParserMsOffice / TrueParserOpenDoc)

  • Microsoft Office: .docx, .doc, .rtf, .xlsx, .xls, .pptx, .ppt, and web formats (.html, .md).
  • OpenDocument (ODF): .odt, .ods, .odp, .odg, .odf.
  • Technical Formats: EPUB, IDML (InDesign), DBF, DIF, and MIF (FrameMaker).

Use Cases

RAG & Semantic Search

Power your LLM pipelines with structured, meaningful context from technical documents.

Compliance & Audit

Automate sensitive data discovery and extraction across large archive sets.

Analytics & Lineage

Track SQL statement lineage and extract business logic for data governance.

GIS Data Ingestion

Streamline spatial data into modern web applications and spatial databases.

Next Steps

Quickstart

Get your first document parsed in minutes.

Architecture

Learn how TrueParser handles scale and reliability.
Last modified on April 1, 2026