Skip to main content

Intelligent Format Identification

The engine identifies most documents automatically by inspecting their binary signature and structure (e.g., PDFs, Email, SQL). However, ZIP-based formats (GIS Shapefiles, MapInfo) require an explicit format hint during ingestion. Because ZIP files act as opaque containers, the system cannot verify whether a payload contains a CAD archive, a GIS dataset, or a generic compression without your instruction via the documentType parameter.

The Routing Decision

When a document enters the system, TrueParser determines the optimal engine using the following priority:

1. Explicit Declaration

If you provide a documentType in your request, TrueParser honors that routing immediately. This is the recommended approach for production pipelines where the source format is known.

2. Automatic Detection

If no type is provided, the platform uses a combination of filename heuristics, MIME types, and signature sniffing to identify the format. Once identified, it is routed to the corresponding family (e.g., TrueParserGis for Geospatial, TrueParserMsOffice for Word/Excel).

Case Study: The CSV Dilemma

CSV is a unique format because it can represent two very different kinds of data:
  • GIS/Spatial: Coordinate lists and point layers.
  • Tabular/Office: Standard spreadsheets and reports.
To resolve this, TrueParser requires an explicit csvRoute at ingestion time. This ensures that your spatial CSV is processed with the full GIS engine (geometry, WKT, BBOX) while your financial reports are handled by the lightweight tabular engine.

Specialized Engine Families

TrueParser orchestrates the following specialized engine families:
  • TrueParserPdf: Technical and Vision-based PDF extraction.
  • TrueParserGis: High-throughput vector spatial ingestion.
  • TrueParserCad: DWG/DXF normalization and world-coordinate flattening.
  • TrueParserSql: Multi-dialect SQL statement analysis and logic extraction.
  • TrueParserMailKit: Email forensics for PST/OST/EML archives.
  • TrueParserMsOffice / TrueParserOpenDoc: Enterprise document processing.

Benefits of Decoupled Routing

By separating routing from the core API:
  • Scalable Processing: The system can scale processing capacity independently for different format families based on your actual traffic patterns.
  • Version Isolation: You can benefit from engine-specific updates (e.g., a new SQL dialect) without any changes to your ingestion code.
  • Reliability: A failure in one engine family does not impact the rest of the parsing platform.
Last modified on April 1, 2026