TrueParser provides a unified API for extracting structured JSON from CAD, GIS, SQL, email archives, and enterprise files while giving tenants a control plane for apps, plans, regions, storage, and demos.Documentation Index
Fetch the complete documentation index at: https://docs.trueparser.com/llms.txt
Use this file to discover all available pages before exploring further.
Why TrueParser?
Unified API
One endpoint for all your parsing needs, regardless of the underlying file format.
Streaming-First
Stream-based processing keeps memory usage low even for very large documents.
Format Agnostic
Built-in detection and routing for CAD, SQL, PDF, and enterprise Office formats.
Control Plane
The dashboard is where tenant teams manage access and setup.- create tenant apps and credentials
- assign plans and billing state
- select a license region
- configure optional tenant-owned storage
- use the Demo UI for manual testing
Format Universe
TrueParser orchestrates specialized engines to deliver deep extraction capabilities across technical domains.GIS & Geospatial
Preserve geometry (points, lines, polygons) and spatial metadata (CRS, attributes).
CAD & Engineering
DWG and DXF. Preserve entities, layers, blocks, and attributes in structured JSON.
SQL & Database
Multi-dialect support for PostgreSQL, T-SQL, Snowflake, BigQuery, and 10+ others.
Email & Archives
PST, OST, EML, and MSG. Extract body, metadata, and attachments.
Enterprise Documents
Microsoft Office (Word, Excel, PowerPoint) and OpenDocument (ODF) support.
Technical PDF
Structured extraction from complex PDF documents with semantic awareness.
Parquet Datasets
Columnar extraction with metadata-first or metadata-plus-rows modes.
Core Capabilities
TrueParser handles the cross-cutting concerns that modern data pipelines require, allowing you to focus on the extracted data rather than the parsing mechanics.High-Performance Architecture
Designed for high-volume forensics and enterprise ingestion:- Zero-Disk Buffering: Direct memory-to-stream processing reduces I/O overhead.
- Lazy Attachment Loading: Large email attachments are only read when explicitly requested, minimizing memory footprint.
- High Throughput: Engines like MailKit and GIS target metadata extraction rates exceeding 400 MB/s.
Normalized Spatial Contracts
TrueParser unifies technical geometry across disparate engines. Whether you are parsing a complex CAD drawing or a multi-layer GIS archive, the platform delivers a consistent spatial contract:- WKT (Well-Known Text): All geometry is normalized into standard WKT for easy database and application ingestion.
- BBOX & Centroids: Every spatial record includes pre-calculated Bounding Box and Centroid metadata.
- World-Coordinate Flattening: CAD blocks and GIS layers are projected into a unified coordinate space for direct interoperability.
Result Materialization
The platform transforms parsed output into final, canonical JSON artifacts stored in managed storage or your tenant-scoped S3-compatible storage, depending on how the app is configured.Document Units
TrueParser measures parsing consumption in Document Units. A Document Unit is the commercial usage unit used for plan entitlements and quota enforcement. Depending on the parser family, one Document Unit may correspond to:- one page for PDF, Microsoft Office, and OpenDocument files
- one email or message item for MailKit sources
- one SQL statement for SQL sources
- one logical dataset for GIS sources
- one processed document for CAD sources
Governance & Security
Built for enterprise use with:- Multi-tenant isolation: Logical and physical separation of tenant data.
- JWT-based auth: Secure machine-to-machine authentication.
- Resource Controls: Granular rate limiting and quotas.
Support Matrix
Below is a detailed view of the formats supported by the TrueParser engine library.PDF (TrueParserPdf)
TrueParserPdf supports three PDF modes:- Basic Single Column: Optimized for standard text and table extraction in single-column layouts.
- Basic Multi Column: Optimized for multi-column reading flows in standard digital PDFs.
- Advanced: OCR for scanned PDFs and OCR-aware extraction for complex digital layouts. Advanced runs are limited to 100 pages per run in beta.
- Maximum file size: 25 MB
- Advanced mode page limit: 100 pages per run
Parquet (TrueParserParquet)
- Modes:
MetadataOnlyandMetadataPlusRows. - Default: If no mode is provided, ingestion defaults to
MetadataOnly. - Use Cases: Fast schema validation, profile generation, and row-level extraction for downstream analytics.
Messaging & Archives (TrueParserMailKit)
Optimized for high-volume forensics and large-scale migrations, the MailKit engine provides exhaustive coverage for PST, OST, MSG, EML, MBOX, and MHT formats. TrueParser delivers deep extraction for rich structured items including Calendar events, Appointments, Contacts, Tasks, and Distribution Lists. The engine handles complex Outlook-specific encapsulations like TNEF (winmail.dat) and provides cryptographic unwrapping for S/MIME (.p7s/.p7m) signed or encrypted messages, preserving all underlying metadata.Enterprise Access: Rich item extraction (Calendar, Tasks, Contacts) is
available on all plans. However, binary attachment extraction is exclusively
supported on Enterprise-tier plans.
GIS & Geospatial (TrueParserGis)
Stateless, stream-first vector parsing engine:- Spatial Extraction: Normalizes geometry, layers, and features into WKT, BBOX, and Centroid.
- Contracts: Native support for zipped uploads (SHP_ZIP, FileGDB.zip, MapInfo.zip).
- Throughput: Targets 1,000+ features/sec for structured spatial ingestion.
- Formats: SHP, GeoJSON, GPKG, KML, KMZ, GML, FGB, FileGDB, MapInfo, SpatiaLite, and more.
CAD & Engineering (TrueParserCad)
Developer-friendly CAD-to-JSON conversion:- Geometry Extraction: Normalizes 30+ CAD entities (LINE, POLYLINE, HATCH, etc.) into structured geometry.
- Normalization: Flattens DWG/DXF blocks into world coordinates with bbox, centroid, and WKT.
- Hierarchy: Maintains drawing relationships with
containerIdandparentIdfor deep traceability. - Spatial Transformation: Automatic 3D-to-2D projections for spatial fields.
SQL (TrueParserSql)
- Dialects: ANSI, BigQuery, ClickHouse, Databricks, DuckDB, MS SQL Server, MySQL, Oracle, PostgreSQL, Redshift, Snowflake, SQLite.
- Advanced: BOM-aware ingestion, malformed statement recovery, and semantic extraction for join/filter analysis.
Enterprise & OpenDocument (TrueParserMsOffice / TrueParserOpenDoc)
- Microsoft Office: .docx, .doc, .rtf, .xlsx, .xls, .pptx, .ppt, and web formats (.html, .md).
- OpenDocument (ODF): .odt, .ods, .odp, .odg, .odf.
- Technical Formats: EPUB, IDML (InDesign), DBF, DIF, and MIF (FrameMaker).
Use Cases
RAG & Semantic Search
Power your LLM pipelines with structured, meaningful context from technical documents.
Compliance & Audit
Automate sensitive data discovery and extraction across large archive sets.
Analytics & Lineage
Track SQL statement lineage and extract business logic for data governance.
GIS Data Ingestion
Streamline spatial data into modern web applications and spatial databases.
Next Steps
Quickstart
Get your first document parsed in minutes.
Dashboard Overview
Learn how tenants, apps, plans, and regions fit together.
Demo UI
Try the tenant-facing test workspace with one app.

