Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving results after search #429

Open
ntorrescsuc opened this issue Mar 27, 2020 · 1 comment
Open

Retrieving results after search #429

ntorrescsuc opened this issue Mar 27, 2020 · 1 comment

Comments

@ntorrescsuc
Copy link

We instal·led las openwayback version, reindexed all crawled content using CDX and start to search.
Reviewing results table after quering for an URLsome of the results has more than one entry for a date when there's only one crawl done using Heritrix, why?
Some times more than one date has an *,I was looking for * meaning but I can't found information.

@ato
Copy link
Member

ato commented Mar 27, 2020

One possible reason is multiple URLs with slight variants (e.g www vs no-www or http vs https or uppercase vs lowecase) are grouped due to URL canonicalization. Also not impossible Heritrix really did collect the same URL multiple times (check the crawl log).

The * means the content of the page changed on this date as determined by comparing its sha1 digest with the previous snapshot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants