Document processing

The document processing
pipeline

Every document travels through a multi-stage, AI-driven pipeline — from secure access control all the way to relevant results. All on-premise, inside your own system.

Processing in numbers

8
Processing stages
6144
Vector dimensions
3+
Search methods
0.95
Recognition accuracy
🔐
Access
📷
OCR
📝
Normalise
🏷️
Doc type
🔑
Extract
🔍
Elastic
🧠
Semantic
Hybrid RAG
Access control and filtering
🔐
Every search begins with access control. The system reads the Passport JSON to determine which organisation the user belongs to and which documents they can access. This ensures that every response contains only documents the user is entitled to see.
InputSearch term + Passport JSON
OutputFiltered document ID list
🔧 Passport JSON ⚙️ Search microservice 📐 tenant / orga / user id 📐 doc type id / directory
1
Access control
OCR processing
2
Basic document processing (OCR)
📷
The text layer of the document is produced via optical character recognition (OCR). For scanned documents this step is essential — it is the foundation for every downstream operation.
InputPhysical or digital document
OutputText layer (full text)
🔧 OCR ⚙️ DocIT (base system) 📐 Exact text alignment
Text normalisation
📝
The raw OCR text is converted into structured Markdown. Noise is removed (repeated headers/footers, page numbers), producing clean, consistent text ready for AI processing.
InputOCR text
OutputNormalised Markdown document
🔧 Markdown generation 🔧 Noise filtering ⚙️ Pipeline service
3
Text normalisation
Document type detection
4
Document type detection
🏷️
The document is converted into a 6144-dimensional vector (fingerprint), and machine learning recognises its type. The system compares it with trained sample fingerprints and classifies with a 0.9–0.95 threshold.
Input6144-dimensional vector (fingerprint)
OutputDocument type (e.g. invoice)
🔧 Machine Learning 🔧 Snapshot fingerprint ⚙️ ML service 📐 Threshold: 0.9–0.95
Structured data extraction
🔑
Based on the detected document type, the system extracts key-value pairs: tax numbers, amounts, dates, names. Three techniques are combined: regular expressions, named-entity recognition (NER), and LLM-based extraction.
InputNormalised Markdown text
OutputKey-value pairs (structured fields)
🔧 Regex 🔧 NER 🔧 LLM ⚙️ Extraction Pipeline 📐 geolocation / org / person
5
Structured data extraction
Elastic search
6
Elastic search
🔍
The normalised text is placed into an elastic search index. This provides classic text search: exact-phrase, partial match, filters, and similarity search on the structured fields.
InputNormalised text
OutputIndexed documents
🔧 Elastic index 🔧 Full-text search 📐 Similarity search
Semantic search
🧠
Documents are chunked and vectorised: each chunk becomes a 6144-dimensional vector. These are stored in the Qdrant vector database, enabling meaning-based search — matching intent, not just words.
InputDocument chunks
OutputVector representation
🔧 Vectorisation 🔧 LLM ⚙️ Qdrant 📐 Context awareness
7
Semantic search
Hybrid search
8
Hybrid search (RAG preparation)
In the final step, elastic and semantic search results are combined. The best matches from both sources are merged and re-ranked: the first pass picks 30 hits, and the Top 10 are returned to the user.
InputElastic + semantic hits
OutputTop 10 ranked results
🔧 Result re-ranking ⚙️ Search microservice (Spark) 📐 Top 30 → Top 10

More information

Get in touch

Fill out the form below and our team will reach out shortly to schedule a demo.

Please provide your name.
Please provide a valid email address.
Please provide a company name.
Format: 12345678-2-41

We only use your data to respond to your inquiry. We do not share it with third parties.