The engine identifies most documents automatically by inspecting their binary signature and structure (e.g., PDFs, Email, SQL).
However, ZIP-based formats (GIS Shapefiles, MapInfo) require an explicit format hint during ingestion. Because ZIP files act as opaque containers, the system cannot verify whether a payload contains a CAD archive, a GIS dataset, or a generic compression without your instruction via the documentType parameter.
The Routing Decision
When a document enters the system, TrueParser determines the optimal engine using the following priority:
1. Explicit Declaration
If you provide a documentType in your request, TrueParser honors that routing immediately. This is the recommended approach for production pipelines where the source format is known.
2. Automatic Detection
If no type is provided, the platform uses a combination of filename heuristics, MIME types, and signature sniffing to identify the format. Once identified, it is routed to the corresponding family (e.g., TrueParserGis for Geospatial, TrueParserMsOffice for Word/Excel).
Case Study: The CSV Dilemma
CSV is a unique format because it can represent two very different kinds of data:
- GIS/Spatial: Coordinate lists and point layers.
- Tabular/Office: Standard spreadsheets and reports.
To resolve this, TrueParser requires an explicit csvRoute at ingestion time. This ensures that your spatial CSV is processed with the full GIS engine (geometry, WKT, BBOX) while your financial reports are handled by the lightweight tabular engine.
Specialized Engine Families
TrueParser orchestrates the following specialized engine families:
- TrueParserPdf: Technical and Vision-based PDF extraction.
- TrueParserGis: High-throughput vector spatial ingestion.
- TrueParserCad: DWG/DXF normalization and world-coordinate flattening.
- TrueParserSql: Multi-dialect SQL statement analysis and logic extraction.
- TrueParserMailKit: Email forensics for PST/OST/EML archives.
- TrueParserMsOffice / TrueParserOpenDoc: Enterprise document processing.
Benefits of Decoupled Routing
By separating routing from the core API:
- Scalable Processing: The system can scale processing capacity independently for different format families based on your actual traffic patterns.
- Version Isolation: You can benefit from engine-specific updates (e.g., a new SQL dialect) without any changes to your ingestion code.
- Reliability: A failure in one engine family does not impact the rest of the parsing platform.
Last modified on April 1, 2026