LayoutLM / LayoutAI
The industry-standard multimodal transformer for layout-aware document intelligence and automated information extraction.

Excalibur is a specialized web interface and computational engine designed for high-fidelity table extraction from PDF documents, built atop the Camelot framework. By 2026, it has solidified its position as the premier bridge between unstructured document layouts and structured data pipelines for enterprise ETL (Extract, Transform, Load) processes. Unlike standard OCR tools that treat documents as flat images, Excalibur utilizes spatial analysis to detect cell boundaries via two primary methods: 'Lattice' (for visual borders) and 'Stream' (for whitespace-delimited layouts). This dual-engine architecture ensures 99% accuracy in preserving table structures during conversion. The technical architecture supports a decoupled stack, allowing for localized deployments where data privacy is paramount, or cloud-native instances for high-throughput batch processing. Its 2026 market position focuses on 'Human-in-the-loop' (HITL) workflows, allowing data scientists to refine detection parameters through an intuitive UI before committing to large-scale automation. As LLMs evolve, Excalibur provides the essential ground-truth structured data required for RAG (Retrieval-Augmented Generation) systems that rely on precise tabular information from legacy corporate documents.
Uses OpenCV to identify table lines through image processing, effectively handling cell-based tables with explicit borders.
The industry-standard multimodal transformer for layout-aware document intelligence and automated information extraction.
The open-source toolkit for deep learning-based document image analysis and structured data extraction.
Automate contract review and revenue recognition with Generative AI-driven document intelligence.
Deterministic Python-based data extraction from PDF and image invoices using template matching.
Verified feedback from the global deployment network.
Post queries, share implementation strategies, and help other users.
Analyzes the whitespace and character grouping (text alignment) to reconstruct tables without visual lines.
A Matplotlib-powered overlay that shows exactly how the tool 'sees' the table structure during the extraction process.
Allows the saving of table coordinates and flavor parameters as JSON objects for reuse on identical document layouts.
Seamless integration with Ghostscript and Tesseract to handle scanned images within PDFs.
Separates the parsing engine from the UI, allowing the core library to be used in headless server environments.
Provides bounding box coordinates for every extracted cell for use in training custom ML models.
Auditors need to extract data from thousands of bank statements that arrive in varying formats.
Registry Updated:2/7/2026
Researchers extracting statistical results from hundreds of multi-column journal articles.
Extracting line items from shipping manifests that use complex, nested tables.