Why TrueParser?
Unified API
One endpoint for all your parsing needs, regardless of the underlying file
format.
Streaming-First
Stream-based processing keeps memory usage low even for very large
documents.
Format Agnostic
Built-in detection and routing for CAD, SQL, PDF, and Enterprise Office
formats.
Format Universe
TrueParser orchestrates specialized engines to deliver deep extraction capabilities across technical domains.GIS & Geospatial
Preserve geometry (points, lines, polygons) and spatial metadata (CRS,
attributes).
CAD & Engineering
DWG and DXF. Preserve entities, layers, blocks, and attributes in structured
JSON.
SQL & Database
Multi-dialect support for PostgreSQL, T-SQL, Snowflake, BigQuery, and 10+
others.
Email & Archives
PST, OST, EML, and MSG. Extract body, metadata, and attachments.
Enterprise Documents
Microsoft Office (Word, Excel, PowerPoint) and OpenDocument (ODF) support.
Technical PDF
Structured extraction from complex PDF documents with semantic awareness.
Core Capabilities
TrueParser handles the cross-cutting concerns that modern data pipelines require, allowing you to focus on the extracted data rather than the parsing mechanics.High-Performance Architecture
Designed for high-volume forensics and enterprise ingestion:- Zero-Disk Buffering: Direct memory-to-stream processing reduces I/O overhead.
- Lazy Attachment Loading: Large email attachments are only read when explicitly requested, minimizing memory footprint.
- High Throughput: Engines like MailKit and GIS target metadata extraction rates exceeding 400 MB/s.
Normalized Spatial Contracts
TrueParser unifies technical geometry across disparate engines. Whether you are parsing a complex CAD drawing or a multi-layer GIS archive, the platform delivers a consistent spatial contract:- WKT (Well-Known Text): All geometry is normalized into standard WKT for easy database and application ingestion.
- BBOX & Centroids: Every spatial record includes pre-calculated Bounding Box and Centroid metadata.
- World-Coordinate Flattening: CAD blocks and GIS layers are projected into a unified coordinate space for direct interoperability.
Result Materialization
The platform transforms parsed output into final, canonical JSON artifacts stored directly in your tenant-scoped S3-compatible storage.Document Units
TrueParser measures parsing consumption in Document Units. A Document Unit is the commercial usage unit used for plan limits and quota enforcement. Depending on the parser family, one Document Unit may correspond to:- one page for PDF, Microsoft Office, and OpenDocument files
- one email or message item for MailKit sources
- one SQL statement for SQL sources
- one logical dataset for GIS sources
- one processed document for CAD sources
Governance & Security
Built for enterprise use with:- Multi-tenant isolation: Logical and physical separation of tenant data.
- JWT-based auth: Secure machine-to-machine authentication.
- Resource Controls: Granular rate limiting and quotas.
Support Matrix
Below is a detailed view of the formats supported by the TrueParser engine library.PDF & Vision (TrueParserPdf)
TrueParserPdf uses a dual-pipeline approach for document intelligence:- Basic Pipeline: High-performance text and table extraction using Syncfusion and Tabula heuristics. Supports
SingleColumnandMultiColumnreading flows. - Advanced Pipeline: Vision model-based extraction for complex layouts and OCR-required documents, normalizing provider-specific output into canonical blocks.
- Output: Canonical JSON artifacts for product retrieval and downstream use.
Messaging & Archives (TrueParserMailKit)
Optimized for high-volume forensics and large-scale migrations, the MailKit engine provides exhaustive coverage for PST, OST, MSG, EML, MBOX, and MHT formats. TrueParser delivers deep extraction for rich structured items including Calendar events, Appointments, Contacts, Tasks, and Distribution Lists. The engine handles complex Outlook-specific encapsulations like TNEF (winmail.dat) and provides cryptographic unwrapping for S/MIME (.p7s/.p7m) signed or encrypted messages, preserving all underlying metadata.Enterprise Access: Rich item extraction (Calendar, Tasks, Contacts) is
available on all plans. However, binary attachment extraction is
exclusively supported on Enterprise-tier plans.
GIS & Geospatial (TrueParserGis)
Stateless, stream-first vector parsing engine:- Spatial Extraction: Normalizes geometry, layers, and features into WKT, BBOX, and Centroid.
- Contracts: Native support for zipped uploads (SHP_ZIP, FileGDB.zip, MapInfo.zip).
- Throughput: Targets 1,000+ features/sec for structured spatial ingestion.
- Formats: SHP, GeoJSON, GPKG, KML, KMZ, GML, FGB, FileGDB, MapInfo, SpatiaLite, and more.
CAD & Engineering (TrueParserCad)
Developer-friendly CAD-to-JSON conversion:- Geometry Extraction: Normalizes 30+ CAD entities (LINE, POLYLINE, HATCH, etc.) into structured geometry.
- Normalization: Flattens DWG/DXF blocks into world coordinates with bbox, centroid, and WKT.
- Hierarchy: Maintains drawing relationships with
containerIdandparentIdfor deep traceability. - Spatial Transformation: Automatic 3D-to-2D projections for spatial fields.
SQL (TrueParserSql)
- Dialects: ANSI, BigQuery, ClickHouse, Databricks, DuckDB, MS SQL Server, MySQL, Oracle, PostgreSQL, Redshift, Snowflake, SQLite.
- Advanced: BOM-aware ingestion, malformed statement recovery, and semantic extraction for join/filter analysis.
Enterprise & OpenDocument (TrueParserMsOffice / TrueParserOpenDoc)
- Microsoft Office: .docx, .doc, .rtf, .xlsx, .xls, .pptx, .ppt, and web formats (.html, .md).
- OpenDocument (ODF): .odt, .ods, .odp, .odg, .odf.
- Technical Formats: EPUB, IDML (InDesign), DBF, DIF, and MIF (FrameMaker).
Use Cases
RAG & Semantic Search
Power your LLM pipelines with structured, meaningful context from technical
documents.
Compliance & Audit
Automate sensitive data discovery and extraction across large archive sets.
Analytics & Lineage
Track SQL statement lineage and extract business logic for data governance.
GIS Data Ingestion
Streamline spatial data into modern web applications and spatial databases.
Next Steps
Quickstart
Get your first document parsed in minutes.
Architecture
Learn how TrueParser handles scale and reliability.

