Digitized Yet Hidden: FAIRifying Archaeological Archives through Vision and Language Models
- Creators:
- Lutsai, Kateryna, Novák, David, Křivánková, Dana, Straňák, Pavel, Pajdla, Petr, Harasim, Ronald, and Lečbychová, Olga
We present a compact, reusable suite of digital-research workflows designed to improve Findability, Accessibility, Interoperability, and Reuse (FAIR principles) for large institutional archives of scanned documents and photographs created in the 20th and early 21st centuries. The workflows combine content-specific page classification, selective OCR/HTR processing, and vocabulary-driven NLP to boost (meta)data quality in a Czech digital repository of archaeological fieldwork and related discovery services. We are sharing resources designed to help archivists and researchers streamline targeted analysis while minimizing the irrelevant processing of pages.
Many heritage collections hold high-resolution scans of paper-born materials with little or inconsistent metadata. This legacy material, like handwritten notes, machine-typed reports, photocards, maps, drawings, and mixed-content pages, presents a dual problem: (1) it is valuable but hard to discover, and (2) modern off-the-shelf recognition tools often fail on century-old or repeatedly re-scanned pages that exhibit folds, stains, noise, and atypical layouts. To address these gaps, we developed workflows that first systematize page images by high-level content (table, photograph, drawing/map, plain text, mixed/handwritten cutouts, etc.). Knowing a page’s content allows selective application of specialized tools (OCR, HTR, object classification, or manual review), saving compute and improving overall downstream quality. A short, domain-adapted lexicon (field-specific Czech terms translated to English) supports vocabulary-driven labeling and later metadata expansion so extracted entities map to stable controlled values.
We assembled and annotated a dataset of almost 50,000 scanned pages mapped to 11 semantic content classes, hosted in a LINDAT repository; accompanying training code, README, and prediction utilities are publicly available (prepackaged for easy use). For image-based page sorting, we fine-tuned a range of backbones, including transformers (DiT, ViT), convolutional networks (EfficientNet2, RegNetY), and hybrid image-text (CLIP) models, to capture both layout and visual cues. Model evaluation includes accuracy numbers and confusion matrices per class (predicted vs. true), while the trained checkpoints and evaluation scripts are published for reproducibility.
The content-specific classifiers provide a practical systematization of collections that substantially reduces unnecessary processing: pages routed to targeted tools require fewer retries and yield higher-quality outputs than a one-size-fits-all pipeline. We also found that contemporary Document Layout Analysis (DLA) frameworks, like DeepDoctection using Google’s Tesseract for OCR and Facebook AI Research’s detectron2 for structured data recognition, trained on clean digital PDFs often misidentify tables and miss text blocks on heavily degraded archival pages; this confirms the value of a front-end content classifier tuned on historical scans to gate further processing.
When pages are classified as text-containing, we run OCR/HTR selectively. In our experiments, we used ABBYY FineReader for automated OCR extraction (ALTO XML outputs) and measured language and quality characteristics with LanguageID and a causal-LM perplexity probe by distilgpt2 to filter reliably extracted text from noisy results. Noisy handwritten segments are flagged for more specialized HTR (e.g., finetuned Kraken/PERO-style workflows) or manual transcription. Moreover, we implemented lightweight shell and Python “glue” scripts to parse XML outputs, extract plain text, and compute simple XML-element statistics across directories; a student-developed ALTO viewer that maps XML to the JPEG page image and allows editing is available for demonstrations.
Selective OCR plus language-detection and perplexity thresholds reliably filter well-extracted text from low-quality OCR outputs, enabling automated downstream NER by off-the-shelf NameTag3 for the cleaner subset. Experimental results with Czech and multilingual NameTag3 models look promising for adapting language-specific NER tools and fine-tuning them on domain-created CoNLL-U-style datasets (UDPipe-derived parses with domain-specific entity annotations). This would enable both structured metadata extraction (geographic names, artifact types, numeric identifiers) and vocabulary-driven entity extraction.
Our workflows strike a pragmatic balance between automation and targeted human intervention. By combining content-aware classification, automated text recognition, controlled-vocabulary extraction, and simple ALTO XML tooling, we increase the discoverability of legacy archaeological documentation while controlling processing costs and error propagation. The released dataset, models, and console tools are intended as baseline solutions that others in digital humanities and cultural heritage can adapt: they make it straightforward to plug alternative OCR/HTR engines, swap backbone models, or extend the domain lexicon. Future work will expand HTR adaptation for handwritten Czech variants, introduce object-level photo classification with domain labels, and integrate the pipeline into repository ingestion workflows so that both legacy holdings and new uploads benefit from automated, FAIR-aware metadata enrichment.