AI Lund Lunch seminar: Reading Älvsborg’s Ransome: How to turn 16th century hand-written tax records in to structured economic information
We present a system for extracting tabular information from loosely structured handwritten documents. The system consists of three parts,
- a u-net like CNN-basedmethod for text detection and segmentation,
- an attention-based method for simultaneous text recognition and classification of word-parts, and
- a method for matching the word parts into a tabular structure for each entry.
A key contribution is the observation that the attention-based recognition and classification module makes it possible for improved spatial analysis of the tabular information. The method is evaluated on a unique historical document: The Swedish Wealth Tax of 1571, consisting of 11,453 pages of hand-written tax records. The evaluation shows that the system provides a significant improvement to the state-of-the-art to the problem of tabular extraction from loosely structured historical documents.