-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Document how to deal with bots on live sites #2286
Comments
We've experienced this as well and it's brought out site to its knees. |
Same. Tiktok ignores robots.txt. We have one sight that was getting several hits per second before we stuck a user agent filter in. |
Drupal specific info here: https://dev.acquia.com/blog/automated-bot-traffic-strategies-handle-and-manage-it |
Suggestions from tech call below: Blocking bots by user agent:
Stopping legit bots from crawling facets:
Remaining questions:
|
Nginx allows for multiple conf files. We could add an include in nginx.conf to point to a file in /var/www/drupal which would eliminate the need for a separate mount. |
fwiw, I've found the patch for facets at https://www.drupal.org/node/2937191 useful; it converts the facets into actual checkboxes instead of the default that renders them as links (followable by bots) that get converted to checkboxes by js. |
@kayakr that would be awesome if we could get that patched into the facets module. I really like the checkboxes for facets but am having the same issue with bots. |
I just wanted to share a link to a presentation created by the Islandora community itself that was given during the October 2024 "OPen Meeting" on the topic of bots. Maybe we can use the presentation as a source of content for use in the official docs. At the very least we can link to a PDF version of this document. "Bots - Islandora Open Meeting - Oct 29 2024" |
I'm still looking into this, and have a few things I can update on. Blocking IP addresses can be done in iptables like this Blocking user agents can be done in nginx by appending to drupal.defaults.conf. I have added the following to my Dockerfile which blocks certain user agents and also adds a couple lines to my robots.txt. The above idea of adding a new conf and mounting it is probably a better solution, but this works in the mean time.
In theory this should block legit crawlers like google from crawling my search pages and my flysystem URLs (direct linking to PDFs) This all helps a bit, but it only really helps with legit crawlers who respect robots.txt or are honest about their user agent. I've not tried the following things yet, but a few possible solutions for managing this:
|
您好!您的邮件我已收到!谢谢!
|
We should write up some docs about how to block bots from crawling a live site that is set up with Docker. This has come up a few times in Slack and it would be good to have something to explain how to deal with it.
One option some of us have been using is to edit drupal.defaults.conf to return a 403 based on user agent. I have done this by adding the following to my Dockerfile, but you could also mount the conf file and edit it manually:
It would also be nice to document how to block by IP address using Docker.
Related, but possibly a separate issue, is that bots are getting stuck looping over facets. I'm seeing this on my site with legit bots as well, like bingbot. If there is a way to prevent this we should document that as well.
The text was updated successfully, but these errors were encountered: