A keyword search engine for historical Ottoman documents

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

Abstract

In this study, a keyword search system is presented for the easy indexing and retrieval of historical Ottoman documents by matching the visual shapes of words. With the help of this system, one would be able to search any keyword through thousands of documents in a fully automatic manner. Firstly, given a document collection, it is preprocessed by a binarization method, and small noises are cleaned by removing connected components smaller than a predefined threshold. Then, the pages are segmented into lines by a run-length smoothing algorithm. Words are then manually extracted and represented by patch-based and column-based features. The similarity between words is calculated by the Euclidean distance of feature vectors and words are ready to be matched based on a threshold of their similarity. An indexing and retrieval schema is provided for all words in the collection so that a user can search keywords like a search engine and retrieve all documents related to that keyword. Our experiments on an Ottoman collection show promising results for both intra- and cross-document word retrieval schemes.

Original languageEnglish
Title of host publicationDigital Historical Research on Southeast Europe and the Ottoman Space
PublisherPeter Lang AG
Pages211-223
Number of pages13
ISBN (Electronic)9783631826157
ISBN (Print)9783631825112
Publication statusPublished - 14 Dec 2020
Externally publishedYes

Fingerprint

Dive into the research topics of 'A keyword search engine for historical Ottoman documents'. Together they form a unique fingerprint.

Cite this