CITADEL1
The accompanying code is the backend of a toponym disambiguation tool developed and employed for the purpose of Historical GIS and Guidebooks: A Scalable Reading of Czechoslovak Tourist Attractions.
@article{pedersen2023historical,
title={Historical GIS and Guidebooks: A Scalable Reading of Czechoslovak Tourist Attractions},
author={Pedersen, Sune Bechmann and Johansson, Mathias},
journal={Digital Humanities Quarterly},
volume={17},
number={2},
year={2023},
publisher={Alliance of Digital Humanities Organisations}
}
The application has been assembled by a range of open-source tools, so anyone interested can simply clone this repository and apply it to their own needs. If you do, we kindly ask that you cite the above mentioned article and this repository.
In tandem with this backen we have also constructed a graphical user interface through the AnvilWorks application framework.
At the core of the application is a small relational database powered but which we fill with known positions from the countries we are interested in and all the toponyms that we can find in the relevant languages. This gives the application something to compare added, unknown toponyms to with the hope of finding their correct location.
After adding new toponyms to the database each toponym is compared agains all toponyms that has a link to a position in a series of sequantially looser string comparisons, if it finds but a single matching position this is recorded and when there are multiple options all of these suggestions are recorded and will need to be manually disambiguaged using the GUI.
The key part of the application lies in the python code which is run either locally on a machine or on a server somewhere. This backend has been written for and tested in Python 3.9.5,
and should therefore work for all Python 3.9.5+ versions.
The graphical user interface is created using !AnvilWorks, and all you need in order to run this application is a free-tier account.
-
Install the requirements from the requirements.txt file.
-
Run
settings.py
to select which countries and languages should be used to seed the database. For a complete set of instructions of how the script works runsettings.py -h
-
Run
seed.py
to fill the database with positions and toponyms from and further alternative names from from all the selected countries and in all the selected languages.- Retreiving data from WikiData can take a long time, and if the script gets interrupted during this stage of the process run
operations.py
to finish retreiving data from WikiData.
- Retreiving data from WikiData can take a long time, and if the script gets interrupted during this stage of the process run
-
https://anvil.works/build#page:apps - [import from file] Take note of your server token and use it to replace the placeholder in your settings.yaml file. This token is personal and should not be shared with anyone it will let the webb interface interact with your local python installation.
In order to start the server all you need to do is to run the
toponym_main.py
script and it will then be ready to take instructions
from the webb application and handle all connections to the database.
todo: add text
todo: add text
todo: add text
todo: add text
There are two types of exports supported in the GUI, a .tsv file with the toponyms and their associated positions by source or year and a clustered export.
For these exports you select a source (year) to include, with the option of including another source (year) and excluding another source (year). The results will be presented in a simple text area that can be copied and pasted into a .tsv file of your choice.
The second export function gives an outpout similar to the primary export, with one important different: It first clusters points that are within a set radius. Clustering may be necessary for two reasons, either there are many toponyms that required a lot of unique coordinates to be recorded, splintering the toponyms unecessarity. Or, there are too many toponyms in a relatively small area, which is difficult to plot properly on a national level map.
For simplicity's sake the distance between two points is calculated using the cartesian distance between coordinates (longitude and latitude). Any two used positions (linked to an added toponym) within the set radius, calculated as the cartesian distance between the coordinates, are put in the same cluster. This simple clustering appriach means that we have to be careful when selecting the radius: In the unlikely event that there is a long line of points within a short distance from eachother this long line would become a single cluster. Each cluster's final coordinates is calculated as the unweighted arithmetic mean latitude and longitude of all points in the cluster.
[] - Make the Browsing facility more user friendly
[] - Add user authentication and logging
[] - Add attractions
Footnotes
-
plaCename dIsambiguaTion AnD gEocoding appLication ↩