Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wikidata <> authors integration: first steps #8236

Closed

Conversation

RayBB
Copy link
Collaborator

@RayBB RayBB commented Aug 28, 2023

CLOSED IN FAVOR OF #9130

Document with the goal of this project

v0 goals:

  • to start pulling wikidata into the database and showing something basic
  • Upstream code class Author —> Add get_wikidata
  • Should Fetch from wikidata based on parameter. Otherwise only check wikidata_cache table
  • Author edit/view templates will call this
    • Render_wikidata_infobox method for rending the wikidata author box

Next steps:

  • Decide on a bulk import method
  • Enrich the infobox
  • Use wikidata for author autocomplete

Technical

Trying to keep it as simple as possible.
The wikidata method is on the authors model because works/editions store wikidata IDs differently so we'll have to handle that when we get there. My current coals is just authors.

Testing

Add a wikidata ID to an author and then you'll start seeing this "short description" field showing up on the side.

Screenshot

Aug 30 demo video (slightly outdated), see below
wd_demo.mp4
September 2 screenshot (most recent) image

Stakeholders

@cdrini

@davidscotson
Copy link
Collaborator

It would be really neat if this was added to the autocomplete, which currently pulls in the author's name, date-of-birth/death, genres and top work.

On the Wikidata side this is mostly designed to be used to differentiate between two identically named items, so it's an ideal use case.

It might also be worth looking at the Wikidata-powered infoboxes on Wikipedia to see what kind of info they surface.

https://en.wikipedia.org/wiki/Template:Infobox_writer/Wikidata

openlibrary/core/models.py Outdated Show resolved Hide resolved
@tfmorris
Copy link
Contributor

Is there an issue associated with this PR? Making better use of Wikidata data is definitely a good idea, but I'm not sure description is the best place to start. In addition to the I18N issues, these are also mostly machine generated from templates so you're going to end up with lots of " author (birth-death)" which doesn't really add a lot of value.

Also, given the volumes of data involved, it's probably more appropriate to use the data dumps than be hitting their API.

@RayBB RayBB changed the title wikidata proof of concept Wikidata <> authors integration: first steps Sep 1, 2023
@RayBB RayBB marked this pull request as ready for review September 1, 2023 11:35
openlibrary/core/models.py Outdated Show resolved Hide resolved
@hornc
Copy link
Collaborator

hornc commented Sep 4, 2023

What is the usecase behind this? Is there to be a bulk QID fetching UI or something?
Individual author lookups on an individual UI could be done with links or simple requests with a review feature.
I'm not sure what a Wikidata table is for.

@RayBB
Copy link
Collaborator Author

RayBB commented Sep 4, 2023

@hornc
use case: this is an MVP to start integrating data from wikidata.
There will be many next steps to improving the reader experience and using the data for search.
Also I made this to lay out some ideas:
https://github.com/internetarchive/openlibrary/wiki/Wikidata-Integration

@cdrini cdrini added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Sep 11, 2023
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A few small things, but I think the main things are:

The split between the two files is a little confusing, especially since we have WikidataEntities, WikidataEntity, WikidataRow and the distinction between these is a little confusing. I think we want:

@dataclass
class WikidataEntity:
    id: str
    data: dict
    updated: datetime

    # In general we prefer function to have verb names
    def get_description(self, locale: str) -> str | None


def get_entity(id: str, cache_only = True) -> WikidataEntity | None
    
def _get_from_web(id: str) -> WikidataEntity
def _get_from_cache(id: str) -> WikidataEntity | None
def _add_to_cache(id: str) -> None

That will I think keep all the code easily in one place, and since I doubt our wikidata caching will ever really grow to be more complicated than these, I think it'll keep things a touch tidier!

openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/models.py Outdated Show resolved Hide resolved
openlibrary/templates/type/author/edit.html Outdated Show resolved Hide resolved
openlibrary/templates/wikidata_author.html Outdated Show resolved Hide resolved
openlibrary/templates/wikidata_author.html Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the css for this in a new less in static/css/wikidatabox.less.

Then import that CSS from page-user.less .

(I'm sorry, this flow is not great 😅 )

ttl (time to live) inspired by the cachetools api https://cachetools.readthedocs.io/en/latest/#cachetools.TTLCache
"""
entity = WikidataEntities.get_by_id(QID)
if entity and seconds_since(entity.updated) < ttl:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably inline this function, and the timedelta helpers might be a little clearer here!

Suggested change
if entity and seconds_since(entity.updated) < ttl:
if entity and (datetime.now() - entity.updated) < timedelta(days=30):

@cdrini cdrini added the Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] label Sep 22, 2023
@cdrini cdrini removed the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Sep 25, 2023
@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 15.96%. Comparing base (45ed081) to head (9594dd1).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #8236   +/-   ##
=======================================
  Coverage   15.96%   15.96%           
=======================================
  Files          89       89           
  Lines        4710     4710           
  Branches      821      821           
=======================================
  Hits          752      752           
  Misses       3449     3449           
  Partials      509      509           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RayBB RayBB requested a review from cdrini September 26, 2023 00:29
@RayBB
Copy link
Collaborator Author

RayBB commented Sep 26, 2023

@cdrini I think I've addressed all your concerns and the code is looking a lot cleaner.
Ready for your next round of feedback!

@RayBB RayBB added Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] and removed Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] labels Sep 26, 2023
@mekarpeles mekarpeles added the Priority: 1 Do this week, receiving emails, time sensitive, . [managed] label Oct 2, 2023
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Would be great if we can merge those two classes + one other comment.

Next step for me is to create this table on prod so we can deploy this.

Next step for the project is to begin working on bulk import. We can make tweaks to how/where the data is displayed at any point, but getting the bulk data import sorted is likely the next most impactful step.

openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
@RayBB RayBB requested a review from cdrini October 24, 2023 16:30
Copy link
Collaborator

@cdrini cdrini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Logic looks good; a few style things. I created the wikidata table and will toss this up on testing now since the logic is already 👍

openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/core/wikidata.py Show resolved Hide resolved
openlibrary/core/wikidata.py Outdated Show resolved Hide resolved
openlibrary/plugins/wikidata/code.py Outdated Show resolved Hide resolved
@RayBB RayBB requested a review from cdrini April 16, 2024 15:46
@RayBB RayBB added Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed] and removed State: Blocked Work has stopped, waiting for something (Info, Dependent fix, etc. See comments). [managed] Needs: Submitter Input Waiting on input from the creator of the issue/pr [managed] labels Apr 16, 2024
@RayBB RayBB added the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Apr 16, 2024
Comment on lines +173 to +182
<div class="section">
<h6 class="collapse black uppercase">TESTING ONLY WIKIDATA SECTION</h6>
<p style="text-align: center;">
<!-- This is only below subject temporarily to avoid merge conflicts on testing -->
$ wikidata = page.wikidata(fetch_missing=show_librarian_extras)
$if wikidata:
$wikidata.get_description(i18n.get_locale())
</p>
</div>

Copy link
Collaborator Author

@RayBB RayBB Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am leaving this code here now for testing.openlibrary.org to be easier to verify this PR on.

I can either remove it before we merge this or there is already a commit to remove it in #9130

@cdrini cdrini removed the On testing.openlibrary.org This PR has been deployed to testing.openlibrary.org for testing label Apr 23, 2024
@RayBB
Copy link
Collaborator Author

RayBB commented May 2, 2024

Closing in favor of #9130

@RayBB RayBB closed this May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs: Review This issue/PR needs to be reviewed in order to be closed or merged (see comments). [managed]
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants