A script for harvesting metadata from Wikimedia Commons for the use in Europeana
Given a (set of) categories on Commons along with templates and matching patterns for external links in a json file (see examples in projects folder); it queries the Commons API for metadata about the images and follows up by investigating the templates used and external links on each filepage. The resulting information is outputted to an xml file, per Europeana specifications.
Additionally the data is outputed (along with a few unused fields) as a csv to allow for easier analysis/post-processing together with an analysis of used categories and a logfile detailing potential problems in the data.
For lazy/frequent use stick username/password on Wikimedia Commons into config.py as variables user/password (in unicode). If not pressent then getpass is imported and used to prompt for username and password.
Usage: python Europeana.py filename option
where:
filename
(required): the (unicode)string relative pathname to the json file for the projectoption
(optional): if set to:verbose
: toggles on verbose mode with additional output to the terminaltest
: toggles on testing (a verbose and limited run)
Requires WikiApi
WikiApi is based on PyCJWiki Version 1.31 (C) by Smallman12q GPL, see http://www.gnu.org/licenses/.