Wikidata <> authors integration: first steps #8236

RayBB · 2023-08-28T22:37:18Z

CLOSED IN FAVOR OF #9130

v0 goals:

to start pulling wikidata into the database and showing something basic
Upstream code class Author —> Add get_wikidata
Should Fetch from wikidata based on parameter. Otherwise only check wikidata_cache table
Author edit/view templates will call this
- Render_wikidata_infobox method for rending the wikidata author box

Next steps:

Decide on a bulk import method
Enrich the infobox
Use wikidata for author autocomplete

Technical

Trying to keep it as simple as possible.
The wikidata method is on the authors model because works/editions store wikidata IDs differently so we'll have to handle that when we get there. My current coals is just authors.

Testing

Add a wikidata ID to an author and then you'll start seeing this "short description" field showing up on the side.

Screenshot

Aug 30 demo video (slightly outdated), see below

wd_demo.mp4

September 2 screenshot (most recent)

Stakeholders

@cdrini

davidscotson · 2023-08-29T07:04:08Z

It would be really neat if this was added to the autocomplete, which currently pulls in the author's name, date-of-birth/death, genres and top work.

On the Wikidata side this is mostly designed to be used to differentiate between two identically named items, so it's an ideal use case.

It might also be worth looking at the Wikidata-powered infoboxes on Wikipedia to see what kind of info they surface.

https://en.wikipedia.org/wiki/Template:Infobox_writer/Wikidata

openlibrary/core/models.py

tfmorris · 2023-08-29T16:10:05Z

Is there an issue associated with this PR? Making better use of Wikidata data is definitely a good idea, but I'm not sure description is the best place to start. In addition to the I18N issues, these are also mostly machine generated from templates so you're going to end up with lots of " author (birth-death)" which doesn't really add a lot of value.

Also, given the volumes of data involved, it's probably more appropriate to use the data dumps than be hitting their API.

openlibrary/plugins/wikidata/code.py

openlibrary/core/wikidata.py

openlibrary/core/models.py

hornc · 2023-09-04T20:01:55Z

What is the usecase behind this? Is there to be a bulk QID fetching UI or something?
Individual author lookups on an individual UI could be done with links or simple requests with a review feature.
I'm not sure what a Wikidata table is for.

RayBB · 2023-09-04T22:40:57Z

@hornc
use case: this is an MVP to start integrating data from wikidata.
There will be many next steps to improving the reader experience and using the data for search.
Also I made this to lay out some ideas:
https://github.com/internetarchive/openlibrary/wiki/Wikidata-Integration

cdrini

Nice! A few small things, but I think the main things are:

The split between the two files is a little confusing, especially since we have WikidataEntities, WikidataEntity, WikidataRow and the distinction between these is a little confusing. I think we want:

@dataclass
class WikidataEntity:
    id: str
    data: dict
    updated: datetime

    # In general we prefer function to have verb names
    def get_description(self, locale: str) -> str | None


def get_entity(id: str, cache_only = True) -> WikidataEntity | None
    
def _get_from_web(id: str) -> WikidataEntity
def _get_from_cache(id: str) -> WikidataEntity | None
def _add_to_cache(id: str) -> None

That will I think keep all the code easily in one place, and since I doubt our wikidata caching will ever really grow to be more complicated than these, I think it'll keep things a touch tidier!

openlibrary/core/wikidata.py

openlibrary/core/models.py

openlibrary/templates/type/author/edit.html

openlibrary/templates/wikidata_author.html

cdrini · 2023-09-22T01:32:07Z

openlibrary/templates/wikidata_author.html

Put the css for this in a new less in static/css/wikidatabox.less.

Then import that CSS from page-user.less .

(I'm sorry, this flow is not great 😅 )

cdrini · 2023-09-22T01:48:22Z

openlibrary/plugins/wikidata/code.py

+    ttl (time to live) inspired by the cachetools api https://cachetools.readthedocs.io/en/latest/#cachetools.TTLCache
+    """
+    entity = WikidataEntities.get_by_id(QID)
+    if entity and seconds_since(entity.updated) < ttl:


We can probably inline this function, and the timedelta helpers might be a little clearer here!

Suggested change

if entity and seconds_since(entity.updated) < ttl:

if entity and (datetime.now() - entity.updated) < timedelta(days=30):

codecov-commenter · 2023-09-26T00:17:18Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 15.96%. Comparing base (45ed081) to head (9594dd1).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #8236   +/-   ##
=======================================
  Coverage   15.96%   15.96%           
=======================================
  Files          89       89           
  Lines        4710     4710           
  Branches      821      821           
=======================================
  Hits          752      752           
  Misses       3449     3449           
  Partials      509      509

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RayBB · 2023-09-26T00:29:56Z

@cdrini I think I've addressed all your concerns and the code is looking a lot cleaner.
Ready for your next round of feedback!

cdrini

This looks good to me! Would be great if we can merge those two classes + one other comment.

Next step for me is to create this table on prod so we can deploy this.

Next step for the project is to begin working on bulk import. We can make tweaks to how/where the data is displayed at any point, but getting the bulk data import sorted is likely the next most impactful step.

openlibrary/core/wikidata.py

cdrini

Nice! Logic looks good; a few style things. I created the wikidata table and will toss this up on testing now since the logic is already 👍

openlibrary/core/wikidata.py

openlibrary/plugins/wikidata/code.py

RayBB · 2024-04-21T00:20:47Z

openlibrary/templates/type/author/view.html

+        <div class="section">
+            <h6 class="collapse black uppercase">TESTING ONLY WIKIDATA SECTION</h6>
+            <p style="text-align: center;">
+                <!-- This is only below subject temporarily to avoid merge conflicts on testing -->
+                $ wikidata = page.wikidata(fetch_missing=show_librarian_extras)
+                $if wikidata:
+                    $wikidata.get_description(i18n.get_locale())
+            </p>
+        </div>
+


I am leaving this code here now for testing.openlibrary.org to be easier to verify this PR on.

I can either remove it before we merge this or there is already a commit to remove it in #9130

RayBB · 2024-05-02T16:27:03Z

Closing in favor of #9130

cclauss reviewed Aug 29, 2023

View reviewed changes

openlibrary/core/models.py Outdated Show resolved Hide resolved

cclauss reviewed Aug 29, 2023

View reviewed changes

openlibrary/plugins/wikidata/code.py Outdated Show resolved Hide resolved

cclauss reviewed Sep 1, 2023

View reviewed changes

openlibrary/plugins/wikidata/code.py Outdated Show resolved Hide resolved

cclauss reviewed Sep 1, 2023

View reviewed changes

openlibrary/core/wikidata.py Outdated Show resolved Hide resolved

RayBB changed the title ~~wikidata proof of concept~~ Wikidata <> authors integration: first steps Sep 1, 2023

RayBB marked this pull request as ready for review September 1, 2023 11:35

cclauss reviewed Sep 1, 2023

View reviewed changes

openlibrary/core/models.py Outdated Show resolved Hide resolved

cclauss mentioned this pull request Sep 1, 2023

ruff rule UP007: Use X | Y for type annotations from PEP 604 #8252

Merged

RayBB mentioned this pull request Sep 1, 2023

Add author configs to pg_dump file #8248

Merged

mekarpeles assigned cdrini Sep 5, 2023

cdrini added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Sep 11, 2023

cdrini reviewed Sep 22, 2023

View reviewed changes

cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Sep 22, 2023

RayBB mentioned this pull request Sep 24, 2023

ruff rule UP007: Use X | Y for type annotations for solr #8327

Closed

cdrini removed the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Sep 25, 2023

RayBB requested a review from cdrini September 26, 2023 00:29

RayBB added Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] and removed Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] labels Sep 26, 2023

mekarpeles added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Oct 2, 2023

This was referenced Oct 6, 2023

Integrate Wikidata #710

Open

Accept full URLs any place an identifier can be entered #866

Open

cdrini reviewed Oct 17, 2023

View reviewed changes

openlibrary/core/wikidata.py Outdated Show resolved Hide resolved

openlibrary/core/wikidata.py Outdated Show resolved Hide resolved

RayBB requested a review from cdrini October 24, 2023 16:30

RayBB added 6 commits April 14, 2024 23:31

default to using cache

f29317a

remove visual changes

9521310

better comment

d903c81

add fetch_missing

bab60d7

simplify html

fdfbbe6

text align center p tags

d9f50c9

cdrini requested changes Apr 15, 2024

View reviewed changes

RayBB added 11 commits April 16, 2024 00:12

lowercase qid

03e6d28

add typehints

3a543bb

_updated

5679a67

get_description

3166950

delete empty code.py

195b040

**response

1683f48

handle no wikidata case

78a6cf2

add error logging

751899e

simplify if

e34c6a5

typo

be86563

to_wikidata_api_json_format

3fd3620

RayBB requested a review from cdrini April 16, 2024 15:46

lower wikidata section for testing

2662f2b

RayBB added the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Apr 16, 2024

restore extra line

37f34d3

RayBB mentioned this pull request Apr 21, 2024

Wikidata v0 with author description and infobox #9130

Merged

RayBB commented Apr 21, 2024

View reviewed changes

cdrini removed the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Apr 23, 2024

RayBB closed this May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wikidata <> authors integration: first steps #8236

Wikidata <> authors integration: first steps #8236

RayBB commented Aug 28, 2023 •

edited

Loading

davidscotson commented Aug 29, 2023

tfmorris commented Aug 29, 2023

hornc commented Sep 4, 2023

RayBB commented Sep 4, 2023

cdrini left a comment •

edited

Loading

cdrini Sep 22, 2023

cdrini Sep 22, 2023

codecov-commenter commented Sep 26, 2023 •

edited

Loading

RayBB commented Sep 26, 2023

cdrini left a comment

cdrini left a comment

RayBB Apr 21, 2024 •

edited

Loading

RayBB commented May 2, 2024

	if entity and seconds_since(entity.updated) < ttl:
	if entity and (datetime.now() - entity.updated) < timedelta(days=30):

Wikidata <> authors integration: first steps #8236

Wikidata <> authors integration: first steps #8236

Conversation

RayBB commented Aug 28, 2023 • edited Loading

Technical

Testing

Screenshot

Stakeholders

davidscotson commented Aug 29, 2023

tfmorris commented Aug 29, 2023

hornc commented Sep 4, 2023

RayBB commented Sep 4, 2023

cdrini left a comment • edited Loading

Choose a reason for hiding this comment

cdrini Sep 22, 2023

Choose a reason for hiding this comment

cdrini Sep 22, 2023

Choose a reason for hiding this comment

codecov-commenter commented Sep 26, 2023 • edited Loading

Codecov Report

RayBB commented Sep 26, 2023

cdrini left a comment

Choose a reason for hiding this comment

cdrini left a comment

Choose a reason for hiding this comment

RayBB Apr 21, 2024 • edited Loading

Choose a reason for hiding this comment

RayBB commented May 2, 2024

RayBB commented Aug 28, 2023 •

edited

Loading

cdrini left a comment •

edited

Loading

codecov-commenter commented Sep 26, 2023 •

edited

Loading

RayBB Apr 21, 2024 •

edited

Loading