You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In several occasions, I've seen a lot of URLs requested on a hosts, even though the maxURLPerSchemeAuthority was low (maybe 50-100).
It seems that duplicates and other non-content responses (401,403) are not counted. This behaviour makes sense for a lot of sites, but I think there should be a limit "maxRequestPerSchemeAuthority" to avoid wasting time on sites with a lot of inlinks that leads to nothing (for instance, there are a lot of links pointing toward stumbleupon.com/submit?...... which produces an error).
The text was updated successfully, but these errors were encountered:
guillaumepitel
changed the title
Duplicates or 403 are taken into account by the maxUrlPerSchemeAuthority limit
Duplicates or 403 are not taken into account by the maxUrlPerSchemeAuthority limit
Sep 15, 2017
I've thought of 2 possible implementations for this issue.
First, the naive way : create a second counter (like schemeAuthority2Count) to count requests, and stop when one of the two limits are reached. This is likely a waste of resources, but not that important because the number of schemeauthorities should be low anyway.
Second option, from a maxFailedRequestsPerSchemeAuthority parameter, we could compute a relative weight for unsuccessful queries, and increment the COUNT with a weight of (maxURLs - COUNT)/(maxFailedRequestPerSchemeAuthority) until it reaches maxURLs
Example :
maxURLsPerSchemeAuthority=20
maxFailedRequestsPerSchemeAuthority=40
We increment with weight (20-COUNT)/40 for each failed request and with weight 1 foreach successful request. If we only hit error pages, the counter will reach the value 20 after ~40 requests. If there are no errors nor duplicates, we should eventually get 20 urls. If we have 10 good URLS then weight=1/4, so we need 40 errors to reach the limit of 20 maxURLs.
Since the counter is integer, we need to increment by round(weight) + (Random.getDouble() < (weight - round(weight) ? 1: 0). (See Morris counters)
Of course this is an approximation, but it should work and only requires very little work and has no impact on memory consumption. I'll submit a PR after testing it.
In several occasions, I've seen a lot of URLs requested on a hosts, even though the maxURLPerSchemeAuthority was low (maybe 50-100).
It seems that duplicates and other non-content responses (401,403) are not counted. This behaviour makes sense for a lot of sites, but I think there should be a limit "maxRequestPerSchemeAuthority" to avoid wasting time on sites with a lot of inlinks that leads to nothing (for instance, there are a lot of links pointing toward stumbleupon.com/submit?...... which produces an error).
The text was updated successfully, but these errors were encountered: