Abstract
In this study, a keyword search system is presented for the easy indexing and retrieval of historical Ottoman documents by matching the visual shapes of words. With the help of this system, one would be able to search any keyword through thousands of documents in a fully automatic manner. Firstly, given a document collection, it is preprocessed by a binarization method, and small noises are cleaned by removing connected components smaller than a predefined threshold. Then, the pages are segmented into lines by a run-length smoothing algorithm. Words are then manually extracted and represented by patch-based and column-based features. The similarity between words is calculated by the Euclidean distance of feature vectors and words are ready to be matched based on a threshold of their similarity. An indexing and retrieval schema is provided for all words in the collection so that a user can search keywords like a search engine and retrieve all documents related to that keyword. Our experiments on an Ottoman collection show promising results for both intra- and cross-document word retrieval schemes.
| Original language | English |
|---|---|
| Title of host publication | Digital Historical Research on Southeast Europe and the Ottoman Space |
| Publisher | Peter Lang AG |
| Pages | 211-223 |
| Number of pages | 13 |
| ISBN (Electronic) | 9783631826157 |
| ISBN (Print) | 9783631825112 |
| Publication status | Published - 14 Dec 2020 |
| Externally published | Yes |
Fingerprint
Dive into the research topics of 'A keyword search engine for historical Ottoman documents'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver