Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikidata <> authors integration: first steps #8236

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
c8413cf
wikidata proof of concept
RayBB Aug 28, 2023
a2c0f63
first pass with postgres
RayBB Sep 1, 2023
8039cb1
cleanup naming and add comments
RayBB Sep 1, 2023
cd4acbc
add ttl check
RayBB Sep 1, 2023
dd61d6c
move wikidata to template and setup caching for templates
RayBB Sep 1, 2023
ae14380
prettier infobox, user language
RayBB Sep 1, 2023
7bd2c96
simplify renter template
RayBB Sep 1, 2023
23cb83b
200 check
RayBB Sep 1, 2023
b826f6a
fix inserting
RayBB Sep 1, 2023
f6df97f
fix bug with inserting vars
RayBB Sep 1, 2023
8438f71
use Optional[]
RayBB Sep 1, 2023
58842ec
move svg to file
RayBB Sep 1, 2023
b529402
note about QIDs
RayBB Sep 1, 2023
ff83ce4
comment to docstring
RayBB Sep 1, 2023
3e7fe15
remove optionals
RayBB Sep 1, 2023
d775c72
use the read-options css
RayBB Sep 1, 2023
2523f5b
add __init__.py
RayBB Sep 1, 2023
d5ad3c3
address some small feedback
RayBB Sep 24, 2023
bb7359a
Update openlibrary/templates/wikidata_author.html
RayBB Sep 24, 2023
768fd24
Update openlibrary/templates/wikidata_author.html
RayBB Sep 24, 2023
bfb8758
move css to less file
RayBB Sep 24, 2023
2da53d9
add less file
RayBB Sep 24, 2023
4172212
move css to less
RayBB Sep 24, 2023
17dd9d5
simplify cache
RayBB Sep 24, 2023
31280cf
first steps to refactor python
RayBB Sep 25, 2023
a646b4b
get rid of WikidataEntities
RayBB Sep 25, 2023
ddc5e9f
remove wikidatarow
RayBB Sep 25, 2023
e4f2556
cache typehints
RayBB Sep 26, 2023
5fb0af6
fix from_db_query
RayBB Sep 26, 2023
c50a3b5
fix datetime
RayBB Sep 26, 2023
32c1917
fix dict
RayBB Sep 26, 2023
e24ffcf
use []
RayBB Sep 26, 2023
cffc7d6
remove extra blank line
RayBB Sep 26, 2023
aee7edd
ttl -> use_cache
RayBB Oct 24, 2023
739f374
rename to APIResponse
RayBB Oct 24, 2023
45cc3fd
fix capitalization
RayBB Oct 24, 2023
a846401
merge wikidata classes
RayBB Oct 26, 2023
ceba66d
simplify with one from_dict method
RayBB Oct 26, 2023
19d179f
improve when we call cache
RayBB Oct 26, 2023
fb0326c
move endpoint to const
RayBB Oct 26, 2023
d970dc5
remove unused import
RayBB Oct 26, 2023
299dce4
only use datetime.now once
RayBB Oct 26, 2023
c9cf97f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 20, 2024
3ca35bd
https link
RayBB Mar 25, 2024
3f50100
Update openlibrary/core/models.py
RayBB Mar 25, 2024
f29317a
default to using cache
RayBB Apr 14, 2024
9521310
remove visual changes
RayBB Apr 14, 2024
d903c81
better comment
RayBB Apr 15, 2024
bab60d7
add fetch_missing
RayBB Apr 15, 2024
fdfbbe6
simplify html
RayBB Apr 15, 2024
d9f50c9
text align center p tags
RayBB Apr 15, 2024
03e6d28
lowercase qid
RayBB Apr 15, 2024
3a543bb
add typehints
RayBB Apr 15, 2024
5679a67
_updated
RayBB Apr 15, 2024
3166950
get_description
RayBB Apr 15, 2024
195b040
delete empty code.py
RayBB Apr 15, 2024
1683f48
**response
RayBB Apr 16, 2024
78a6cf2
handle no wikidata case
RayBB Apr 16, 2024
751899e
add error logging
RayBB Apr 16, 2024
e34c6a5
simplify if
RayBB Apr 16, 2024
be86563
typo
RayBB Apr 16, 2024
3fd3620
to_wikidata_api_json_format
RayBB Apr 16, 2024
2662f2b
lower wikidata section for testing
RayBB Apr 16, 2024
37f34d3
restore extra line
RayBB Apr 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions openlibrary/core/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
from openlibrary.core.ratings import Ratings
from openlibrary.utils import extract_numeric_id_from_olid, dateutil
from openlibrary.utils.isbn import to_isbn_13, isbn_13_to_isbn_10, canonical
from openlibrary.core.wikidata import WikidataEntity, get_wikidata_entity

from . import cache, waitinglist

Expand Down Expand Up @@ -756,6 +757,15 @@ def url(self, suffix="", **params):
def get_url_suffix(self):
return self.name or "unnamed"

def wikidata(
self, bust_cache: bool = False, fetch_missing: bool = False
) -> WikidataEntity | None:
if wd_id := self.remote_ids.get("wikidata"):
return get_wikidata_entity(
qid=wd_id, bust_cache=bust_cache, fetch_missing=fetch_missing
)
return None

def __repr__(self):
return "<Author: %s>" % repr(self.key)

Expand Down
6 changes: 6 additions & 0 deletions openlibrary/core/schema.sql
Original file line number Diff line number Diff line change
Expand Up @@ -90,3 +90,9 @@ CREATE TABLE yearly_reading_goals (
updated timestamp without time zone default (current_timestamp at time zone 'utc'),
primary key (username, year)
);

CREATE TABLE wikidata (
id text not null primary key,
data json,
updated timestamp without time zone default (current_timestamp at time zone 'utc')
)
145 changes: 145 additions & 0 deletions openlibrary/core/wikidata.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
"""
The purpose of this file is to:
1. Interact with the Wikidata API
2. Store the results
3. Make the results easy to access from other files
"""

import requests
import logging
from dataclasses import dataclass
from openlibrary.core.helpers import days_since

from datetime import datetime
import json
from openlibrary.core import db

logger = logging.getLogger("core.wikidata")

WIKIDATA_API_URL = 'https://www.wikidata.org/w/rest.php/wikibase/v0/entities/items/'
WIKIDATA_CACHE_TTL_DAYS = 30


@dataclass
class WikidataEntity:
"""
This is the model of the api response from WikiData plus the updated field
https://www.wikidata.org/wiki/Wikidata:REST_API
"""

id: str
type: str
labels: dict[str, str]
descriptions: dict[str, str]
aliases: dict[str, list[str]]
statements: dict[str, dict]
sitelinks: dict[str, dict]
_updated: datetime # This is when we fetched the data, not when the entity was changed in Wikidata

def get_description(self, language: str = 'en') -> str | None:
"""If a description isn't available in the requested language default to English"""
return self.descriptions.get(language) or self.descriptions.get('en')

@classmethod
def from_dict(cls, response: dict, updated: datetime):
return cls(
**response,
_updated=updated,
)

def to_wikidata_api_json_format(self) -> str:
"""
Transforms the dataclass a JSON string like we get from the Wikidata API.
This is used for storing the json in the database.
"""
entity_dict = {
'id': self.id,
'type': self.type,
'labels': self.labels,
'descriptions': self.descriptions,
'aliases': self.aliases,
'statements': self.statements,
'sitelinks': self.sitelinks,
}
return json.dumps(entity_dict)


def _cache_expired(entity: WikidataEntity) -> bool:
return days_since(entity._updated) > WIKIDATA_CACHE_TTL_DAYS


def get_wikidata_entity(
qid: str, bust_cache: bool = False, fetch_missing: bool = False
) -> WikidataEntity | None:
"""
This only supports QIDs, if we want to support PIDs we need to use different endpoints
By default this will only use the cache (unless it is expired).
This is to avoid overwhelming Wikidata servers with requests from every visit to an author page.
bust_cache must be set to True if you want to fetch new items from Wikidata.
# TODO: After bulk data imports we should set fetch_missing to true (or remove it).
"""
if bust_cache:
_get_from_web(qid)

if entity := _get_from_cache(qid):
if _cache_expired(entity):
return _get_from_web(qid)
return entity

if fetch_missing and not entity:
return _get_from_web(qid)

return None


def _get_from_web(id: str) -> WikidataEntity | None:
response = requests.get(f'{WIKIDATA_API_URL}{id}')
if response.status_code == 200:
entity = WikidataEntity.from_dict(
response=response.json(), updated=datetime.now()
)
_add_to_cache(entity)
return entity
else:
logger.error(f'Wikidata Response: {response.status_code}, id: {id}')
return None
RayBB marked this conversation as resolved.
Show resolved Hide resolved
# Responses documented here https://doc.wikimedia.org/Wikibase/master/js/rest-api/


def _get_from_cache_by_ids(ids: list[str]) -> list[WikidataEntity]:
response = list(
db.get_db().query(
'select * from wikidata where id IN ($ids)',
vars={'ids': ids},
)
)
return [
WikidataEntity.from_dict(response=r.data, updated=r.updated) for r in response
]


def _get_from_cache(id: str) -> WikidataEntity | None:
"""
The cache is OpenLibrary's Postgres instead of calling the Wikidata API
"""
if result := _get_from_cache_by_ids([id]):
return result[0]
return None


def _add_to_cache(entity: WikidataEntity) -> None:
# TODO: after we upgrade to postgres 9.5+ we should use upsert here
oldb = db.get_db()
json_data = entity.to_wikidata_api_json_format()

if _get_from_cache(entity.id):
return oldb.update(
"wikidata",
where="id=$id",
vars={'id': entity.id},
data=json_data,
updated=entity._updated,
)
else:
# We don't provide the updated column on insert because postgres defaults to the current time
return oldb.insert("wikidata", id=entity.id, data=json_data)
1 change: 1 addition & 0 deletions openlibrary/plugins/wikidata/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
'wikidata plugin.'
5 changes: 5 additions & 0 deletions openlibrary/templates/type/author/edit.html
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ <h1>$_("Edit Author")</h1>
<input type="text" name="author--name" id="name" value="$page.name" class="required"/>
</div>
</div>
<p style="text-align: center;">
$ wikidata = page.wikidata(bust_cache=True, fetch_missing=True)
$if wikidata:
$wikidata.get_description(i18n.get_locale())
</p>
</div>

<div>
Expand Down
10 changes: 10 additions & 0 deletions openlibrary/templates/type/author/view.html
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,16 @@ <h6 class="collapse black uppercase">$label</h6>
$:render_subjects(_("Time"), books.facet_counts.get('time_facet'), 'time:')
<!-- /SUBJECTS -->

<div class="section">
<h6 class="collapse black uppercase">TESTING ONLY WIKIDATA SECTION</h6>
<p style="text-align: center;">
<!-- This is only below subject temporarily to avoid merge conflicts on testing -->
$ wikidata = page.wikidata(fetch_missing=show_librarian_extras)
$if wikidata:
$wikidata.get_description(i18n.get_locale())
</p>
</div>

Comment on lines +173 to +182
Copy link
Collaborator Author

@RayBB RayBB Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving this code here now for testing.openlibrary.org to be easier to verify this PR on.

I can either remove it before we merge this or there is already a commit to remove it in #9130

$if "lists" in ctx.features:
<div class="section Tools">
$:render_template("lists/widget", page, include_rating=False, exclude_own_lists=True, show_active_lists=True)
Expand Down
Loading