Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates or 403 are not taken into account by the maxUrlPerSchemeAuthority limit #4

Open
guillaumepitel opened this issue Sep 15, 2017 · 2 comments

Comments

@guillaumepitel
Copy link
Contributor

In several occasions, I've seen a lot of URLs requested on a hosts, even though the maxURLPerSchemeAuthority was low (maybe 50-100).

It seems that duplicates and other non-content responses (401,403) are not counted. This behaviour makes sense for a lot of sites, but I think there should be a limit "maxRequestPerSchemeAuthority" to avoid wasting time on sites with a lot of inlinks that leads to nothing (for instance, there are a lot of links pointing toward stumbleupon.com/submit?...... which produces an error).

@guillaumepitel guillaumepitel changed the title Duplicates or 403 are taken into account by the maxUrlPerSchemeAuthority limit Duplicates or 403 are not taken into account by the maxUrlPerSchemeAuthority limit Sep 15, 2017
@vigna
Copy link
Member

vigna commented Sep 15, 2017

Good point. It could be infinity by default for people who really need content and are willing to wait for it.

@guillaumepitel
Copy link
Contributor Author

I've thought of 2 possible implementations for this issue.

  1. First, the naive way : create a second counter (like schemeAuthority2Count) to count requests, and stop when one of the two limits are reached. This is likely a waste of resources, but not that important because the number of schemeauthorities should be low anyway.
  2. Second option, from a maxFailedRequestsPerSchemeAuthority parameter, we could compute a relative weight for unsuccessful queries, and increment the COUNT with a weight of (maxURLs - COUNT)/(maxFailedRequestPerSchemeAuthority) until it reaches maxURLs

Example :

  • maxURLsPerSchemeAuthority=20
  • maxFailedRequestsPerSchemeAuthority=40

We increment with weight (20-COUNT)/40 for each failed request and with weight 1 foreach successful request. If we only hit error pages, the counter will reach the value 20 after ~40 requests. If there are no errors nor duplicates, we should eventually get 20 urls. If we have 10 good URLS then weight=1/4, so we need 40 errors to reach the limit of 20 maxURLs.

Since the counter is integer, we need to increment by round(weight) + (Random.getDouble() < (weight - round(weight) ? 1: 0). (See Morris counters)

Of course this is an approximation, but it should work and only requires very little work and has no impact on memory consumption. I'll submit a PR after testing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants