Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure browsertrix proxies #1847

Merged
merged 58 commits into from
Oct 3, 2024
Merged
Show file tree
Hide file tree
Changes from 51 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
f0e67c8
backend: add ssh proxies configuration
vnznznz Jul 30, 2024
d96fff4
frontend: add wip ssh proxy selection
vnznznz Jul 30, 2024
2d3e9ef
scripts: add minikube utilities
vnznznz Jul 30, 2024
fca5886
ssh proxy: fix changing proxy in workflow editor
vnznznz Jul 30, 2024
25b813c
formatting
vnznznz Jul 30, 2024
425bed6
Merge branch 'main' into configure-socks-proxies
ikreymer Jul 30, 2024
80542df
cleanup: various renaming / simplifications, remove 'ssh' from names,…
ikreymer Jul 31, 2024
eb4f9f1
fixes: ensure proxyId defaults to "" if none
ikreymer Jul 31, 2024
ba07896
version: bump to 1.12.0-beta.0
ikreymer Jul 31, 2024
f0a3d11
fixes: ssh proxy - allow multiline known_hosts file
vnznznz Jul 31, 2024
e893f89
add proxy support for profiles!
ikreymer Jul 31, 2024
e59e1c8
make proxies more generic, can support ssh://, socks5:// and http://
ikreymer Aug 1, 2024
d575b87
show default proxy in `select-crawler-proxy` + misc visual fixes
vnznznz Aug 7, 2024
dbd51ed
Merge branch 'main' into configure-socks-proxies
ikreymer Aug 8, 2024
3969513
reformat
ikreymer Aug 8, 2024
d96ee8c
Merge branch 'main' into configure-socks-proxies
ikreymer Aug 9, 2024
bd43426
Merge branch 'main' into configure-socks-proxies
ikreymer Aug 15, 2024
ce71535
fix ui post frontend refactor, remove authstate
ikreymer Aug 15, 2024
c7b33fc
more removal of authstate, including from comments
ikreymer Aug 15, 2024
310b647
move proxy config to subchart, allow updating proxies without re-depl…
vnznznz Aug 15, 2024
e48a074
move passwd hack to main chart
vnznznz Aug 15, 2024
7266d1d
add missing docstring
vnznznz Aug 15, 2024
cfaa3b8
fix lint error
vnznznz Aug 29, 2024
8663875
proxies: add shared flag, org proxy settings
vnznznz Sep 2, 2024
c702ba7
proxies: fix backend bugs
vnznznz Sep 3, 2024
b63322c
frontend: add `proxy_not_found` error message
vnznznz Sep 3, 2024
b3dbfe1
frontend: add wip admin proxy gui
vnznznz Sep 3, 2024
2e5fa5f
add missing docstring
vnznznz Sep 3, 2024
379f0b7
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 12, 2024
0cb5d0e
proxy UI fixes after merge
ikreymer Sep 12, 2024
f591b4c
use proxyId from existing profile when running profile browser for ex…
ikreymer Sep 17, 2024
e08500a
proxies subchart: default to 'crawlers' namespace
ikreymer Sep 18, 2024
8d54e28
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 18, 2024
9549123
backend: unpin motor dependency, fixes ImportError on backend start
vnznznz Sep 20, 2024
ca37b2b
backend: improve `get_all_crawler_proxies` endpoint path
vnznznz Sep 20, 2024
d958fa6
backend: disable org shared proxies by default
vnznznz Sep 20, 2024
eaff240
frontend: few more labels to org proxy admin modal
vnznznz Sep 20, 2024
827023a
frontend: misc text changes
vnznznz Sep 20, 2024
d0839b4
ensure proxyId saved on Profile
ikreymer Sep 20, 2024
4214572
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 20, 2024
ae3e909
ensure proxyId is passed through to profile creation
ikreymer Sep 20, 2024
7b052b5
add proxy selector to org defaults
ikreymer Sep 21, 2024
81b07a6
form name fix
ikreymer Sep 21, 2024
8925a2b
fix proxy clearing
ikreymer Sep 21, 2024
4477a1f
misc tweaks: fix workflow default, EmailStr cast, add comments for bt…
ikreymer Sep 21, 2024
f94f31b
Merge branch 'main' into configure-socks-proxies
ikreymer Sep 25, 2024
4dc72a9
reextract strings
ikreymer Sep 25, 2024
4fd3631
WIP: Start adding documentation
tw4l Sep 26, 2024
27c753e
adjust placement of socks proxy to be below profiles
ikreymer Oct 1, 2024
74fa4a8
ensure proxyId included in cronjob, skip cronjob if proxy is missing
ikreymer Oct 2, 2024
68571db
lint fixes
ikreymer Oct 2, 2024
1b3c5dc
Update documentation based on review comments
tw4l Oct 2, 2024
f192bbd
Wordsmith docs
tw4l Oct 2, 2024
a07b4c6
More wordsmithing
tw4l Oct 2, 2024
7214895
update proxy docs
ikreymer Oct 3, 2024
93feaf2
update docs, add proxies subchart to release
ikreymer Oct 3, 2024
c90bc0a
more docs tweaks
ikreymer Oct 3, 2024
3e5302c
rename proxies-passwd-hack -> force-user-and-group-name for clarity
ikreymer Oct 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 113 additions & 1 deletion backend/btrixcloud/crawlconfigs.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import json
import re
import os
import traceback
from datetime import datetime
from uuid import UUID, uuid4
import urllib.parse
Expand Down Expand Up @@ -39,6 +40,8 @@
CrawlConfigSearchValues,
CrawlConfigUpdateResponse,
CrawlConfigDeletedResponse,
CrawlerProxy,
CrawlerProxies,
)
from .utils import dt_now, slug_from_name

Expand All @@ -63,6 +66,8 @@
"name",
)

DEFAULT_PROXY_ID: str | None = os.environ.get("DEFAULT_PROXY_ID")


# ============================================================================
class CrawlConfigOps:
Expand Down Expand Up @@ -125,6 +130,14 @@ def __init__(
if "default" not in self.crawler_images_map:
raise TypeError("The channel list must include a 'default' channel")

self._crawler_proxies_last_updated = None
self._crawler_proxies_map = None

if DEFAULT_PROXY_ID and DEFAULT_PROXY_ID not in self.get_crawler_proxies_map():
raise ValueError(
f"Configured proxies must include DEFAULT_PROXY_ID: {DEFAULT_PROXY_ID}"
)

def set_crawl_ops(self, ops):
"""set crawl ops reference"""
self.crawl_ops = ops
Expand Down Expand Up @@ -168,7 +181,9 @@ async def get_profile_filename(
if not profileid:
return ""

profile_filename = await self.profiles.get_profile_storage_path(profileid, org)
profile_filename, _ = await self.profiles.get_profile_storage_path_and_proxy(
profileid, org
)
if not profile_filename:
raise HTTPException(status_code=400, detail="invalid_profile_id")

Expand All @@ -195,6 +210,11 @@ async def add_crawl_config(
if profileid:
await self.profiles.get_profile(profileid, org)

# ensure proxyId is valid and available for org
if config_in.proxyId:
if not self.can_org_use_proxy(org, config_in.proxyId):
raise HTTPException(status_code=404, detail="proxy_not_found")

now = dt_now()
crawlconfig = CrawlConfig(
id=uuid4(),
Expand All @@ -218,6 +238,7 @@ async def add_crawl_config(
profileid=profileid,
crawlerChannel=config_in.crawlerChannel,
crawlFilenameTemplate=config_in.crawlFilenameTemplate,
proxyId=config_in.proxyId,
)

if config_in.runNow:
Expand Down Expand Up @@ -331,6 +352,8 @@ async def update_crawl_config(
and ((not update.profileid) != (not orig_crawl_config.profileid))
)

changed = changed or (orig_crawl_config.proxyId != update.proxyId)

metadata_changed = self.check_attr_changed(orig_crawl_config, update, "name")
metadata_changed = metadata_changed or self.check_attr_changed(
orig_crawl_config, update, "description"
Expand Down Expand Up @@ -829,6 +852,9 @@ async def run_now_internal(
if await self.get_running_crawl(crawlconfig.id):
raise HTTPException(status_code=400, detail="crawl_already_running")

if crawlconfig.proxyId and not self.can_org_use_proxy(org, crawlconfig.proxyId):
raise HTTPException(status_code=404, detail="proxy_not_found")

profile_filename = await self.get_profile_filename(crawlconfig.profileid, org)
storage_filename = (
crawlconfig.crawlFilenameTemplate or self.default_filename_template
Expand All @@ -848,6 +874,7 @@ async def run_now_internal(

except Exception as exc:
# pylint: disable=raise-missing-from
print(traceback.format_exc())
raise HTTPException(status_code=500, detail=f"Error starting crawl: {exc}")

async def set_config_current_crawl_info(
Expand Down Expand Up @@ -897,6 +924,68 @@ def get_channel_crawler_image(
"""Get crawler image name by id"""
return self.crawler_images_map.get(crawler_channel or "")

def get_crawler_proxies_map(self) -> dict[str, CrawlerProxy]:
"""Load CrawlerProxy mapping from config"""
proxies_last_update_path = os.environ["CRAWLER_PROXIES_LAST_UPDATE"]

if not os.path.isfile(proxies_last_update_path):
return {}

# return cached data, when last_update timestamp hasn't changed
if self._crawler_proxies_last_updated and self._crawler_proxies_map:
with open(proxies_last_update_path, encoding="utf-8") as fh:
tw4l marked this conversation as resolved.
Show resolved Hide resolved
proxies_last_update = int(fh.read().strip())
if proxies_last_update == self._crawler_proxies_last_updated:
return self._crawler_proxies_map
self._crawler_proxies_last_updated = proxies_last_update

crawler_proxies_map: dict[str, CrawlerProxy] = {}
with open(os.environ["CRAWLER_PROXIES_JSON"], encoding="utf-8") as fh:
proxy_list = json.loads(fh.read())
for proxy_data in proxy_list:
proxy = CrawlerProxy(
id=proxy_data["id"],
label=proxy_data["label"],
description=proxy_data.get("description", ""),
country_code=proxy_data.get("country_code", ""),
url=proxy_data["url"],
has_host_public_key=bool(proxy_data.get("ssh_host_public_key")),
has_private_key=bool(proxy_data.get("ssh_private_key")),
shared=proxy_data.get("shared", False)
or proxy_data["id"] == DEFAULT_PROXY_ID,
)

crawler_proxies_map[proxy.id] = proxy

self._crawler_proxies_map = crawler_proxies_map
return self._crawler_proxies_map

def get_crawler_proxies(self):
"""Get CrawlerProxy configuration"""
return CrawlerProxies(
default_proxy_id=DEFAULT_PROXY_ID,
servers=list(self.get_crawler_proxies_map().values()),
)

def get_crawler_proxy(self, proxy_id: str) -> Optional[CrawlerProxy]:
"""Get crawlerProxy by id"""
return self.get_crawler_proxies_map().get(proxy_id)

def can_org_use_proxy(self, org: Organization, proxy: CrawlerProxy | str) -> bool:
"""Checks if org is able to use proxy"""

if isinstance(proxy, str):
_proxy = self.get_crawler_proxy(proxy)
else:
_proxy = proxy

if _proxy is None:
return False

return (
_proxy.shared and org.allowSharedProxies
) or _proxy.id in org.allowedProxies

def get_warc_prefix(self, org: Organization, crawlconfig: CrawlConfig) -> str:
"""Generate WARC prefix slug from org slug, name or url
if no name is provided, hostname is used from url, otherwise
Expand Down Expand Up @@ -983,6 +1072,7 @@ async def stats_recompute_all(crawl_configs, crawls, cid: UUID):
# ============================================================================
# pylint: disable=redefined-builtin,invalid-name,too-many-locals,too-many-arguments
def init_crawl_config_api(
app,
dbclient,
mdb,
user_dep,
Expand Down Expand Up @@ -1060,6 +1150,28 @@ async def get_crawler_channels(
):
return ops.crawler_channels

@router.get("/crawler-proxies", response_model=CrawlerProxies)
async def get_crawler_proxies(
org: Organization = Depends(org_crawl_dep),
):
return CrawlerProxies(
default_proxy_id=DEFAULT_PROXY_ID,
servers=[
proxy
for proxy in ops.get_crawler_proxies_map().values()
if ops.can_org_use_proxy(org, proxy)
],
)

@app.get("/orgs/all/crawlconfigs/crawler-proxies", response_model=CrawlerProxies)
async def get_all_crawler_proxies(
user: User = Depends(user_dep),
):
if not user.is_superuser:
raise HTTPException(status_code=403, detail="Not Allowed")

return ops.get_crawler_proxies()

@router.get("/{cid}/seeds", response_model=PaginatedSeedResponse)
async def get_crawl_config_seeds(
cid: UUID,
Expand Down
13 changes: 7 additions & 6 deletions backend/btrixcloud/crawlmanager.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
""" shared crawl manager implementation """

import os
import asyncio
import secrets

from typing import Optional, Dict
Expand All @@ -16,13 +15,12 @@


# ============================================================================
class CrawlManager(K8sAPI):
"""abstract crawl manager"""
DEFAULT_PROXY_ID: str = os.environ.get("DEFAULT_PROXY_ID", "")

def __init__(self):
super().__init__()

self.loop = asyncio.get_running_loop()
# ============================================================================
class CrawlManager(K8sAPI):
"""abstract crawl manager"""

# pylint: disable=too-many-arguments
async def run_profile_browser(
Expand All @@ -34,6 +32,7 @@ async def run_profile_browser(
crawler_image: str,
baseprofile: str = "",
profile_filename: str = "",
proxy_id: str = "",
) -> str:
"""run browser for profile creation"""

Expand All @@ -55,6 +54,7 @@ async def run_profile_browser(
"vnc_password": secrets.token_hex(16),
"expire_time": date_to_str(dt_now() + timedelta(seconds=30)),
"crawler_image": crawler_image,
"proxy_id": proxy_id or DEFAULT_PROXY_ID,
}

data = self.templates.env.get_template("profile_job.yaml").render(params)
Expand Down Expand Up @@ -138,6 +138,7 @@ async def create_crawl_job(
warc_prefix=warc_prefix,
storage_filename=storage_filename,
profile_filename=profile_filename,
proxy_id=crawlconfig.proxyId or DEFAULT_PROXY_ID,
)

async def create_qa_crawl_job(
Expand Down
1 change: 1 addition & 0 deletions backend/btrixcloud/crawls.py
Original file line number Diff line number Diff line change
Expand Up @@ -379,6 +379,7 @@ async def add_new_crawl(
tags=crawlconfig.tags,
name=crawlconfig.name,
crawlerChannel=crawlconfig.crawlerChannel,
proxyId=crawlconfig.proxyId,
image=image,
)

Expand Down
6 changes: 5 additions & 1 deletion backend/btrixcloud/k8sapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

import os
import traceback

from typing import Optional

import yaml

from kubernetes_asyncio import client, config
Expand Down Expand Up @@ -93,6 +93,7 @@ def new_crawl_job_yaml(
storage_filename: str = "",
profile_filename: str = "",
qa_source: str = "",
proxy_id: str = "",
):
"""load job template from yaml"""
if not crawl_id:
Expand All @@ -115,6 +116,7 @@ def new_crawl_job_yaml(
"storage_filename": storage_filename,
"profile_filename": profile_filename,
"qa_source": qa_source,
"proxy_id": proxy_id,
}

data = self.templates.env.get_template("crawl_job.yaml").render(params)
Expand All @@ -136,6 +138,7 @@ async def new_crawl_job(
storage_filename: str = "",
profile_filename: str = "",
qa_source: str = "",
proxy_id: str = "",
) -> str:
"""load and init crawl job via k8s api"""
crawl_id, data = self.new_crawl_job_yaml(
Expand All @@ -153,6 +156,7 @@ async def new_crawl_job(
storage_filename=storage_filename,
profile_filename=profile_filename,
qa_source=qa_source,
proxy_id=proxy_id,
)

# create job directly
Expand Down
1 change: 1 addition & 0 deletions backend/btrixcloud/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ def main() -> None:
)

crawl_config_ops = init_crawl_config_api(
app,
dbclient,
mdb,
current_active_user,
Expand Down
Loading
Loading