CAP Data Technologies teamed up with researchers from University of Jyväskylä to automatically process medieval magic texts. The goal of the project was to identify similarities between documents, namely alchemical instructions from old books written in the Latin language.
The texts were prepared by Lauri Ockenström, researcher at the Department of Music, Art and Culture Studies. His research interests include imagery of magic and astrology, renaissance magic, hermetic tradition and history of occultism.
Before analysis, ortography was standardized and words reduced to their lemma forms. Some consideration was also given to the definition of a sentence, and almost identical copies of the same document. Latin is a language like any other, and there are good tools to handle natural languages, such as scikit-learn, or specifically Latin, such as CLTK.
For analysis, we took two approaches. Firstly, comparing the documents with term frequency–inverse document frequency (TF-IDF). Secondly, comparing sentences in the whole corpus using TF-IDF. Below is an image of the first approach, showing the distances between the documents.
The second approach gave more insight into which files have similar passages copied from another document. We can see in the screenshot below that ex scientia Abel is indeed similar to other copies of the same document. The level_0 column shows the passage in the current file, while the level_1 column gives the passage found in another document, and the last column shows the similarity number, which in this case is the highest possible.
Modern text mining methods can be useful when trying to find relationships between documents in any language and across multiple domains. This research effort focused on alchemical texts, but many other domains could benefit from natural language processing, such as customer behavior analytics or medical data mining.
For further information from CAP