This repository contains a Docker Compose setup for a complete DomainRadar testing environment. It includes a Kafka cluster using encrypted communication, the prefilter, the pipeline components (collectors, data merger, feature extractor, classifier), a PostgreSQL database, a MongoDB database, Kafka Connect configured to push data to them, and a web UI for Kafka.
The compose.yml Compose file provides several services assigned to several profiles.
- kafka1, the first Kafka broker:
- Exposed on 31013 (through the
kafka-outside-world
network). - Internally, clients use
kafka1:9093
. - SSL authentication, see below.
- Exposed on 31013 (through the
- kafka-ui, Kafbat UI, a web UI for Kafka: 31000
- No authentication is used!
- kafka-connect, the Kafka Connect REST API: 31002
- No authentication (will be changed).
- Included in two flavors: kafka-connect-full and kafka-connect-without-postgres.
- postgres, the PostgreSQL database: 31010
- Password (SCRAM) authentication (probably will be changed, see below).
- mongo, the MongoDB Community database: 31011
- Password (SCRAM) authentication (probably will be changed, see below).
- initializer invokes the wait_for_startup script that exits only when successfuly connected to Kafka, and the prepare_topics script that creates or updates the topics.
- config-manager is the configuration manager. It requires the config_manager_daemon script to be executed on the host machine first.
- standalone-input can be executed to load domain names into the system.
- mongo-domains-refresher and mongo-raw-data-refresher execute the run_periodically script to run MongoDB aggregations.
- Obtain your GeoLite2 City & ASN databases and place them in geoip_data.
- Obtain a NERD token and place it in your client_properties/nerd.properties.
You need to generate a CA, broker certificates and client certificates. Ensure that you have OpenSSL and Java installed (JRE is fine). Then you can run:
./generate_secrets.sh
You can also use the included Docker image:
./generate_secrets_docker.sh
You can change the certificates' validity and passwords by setting the variables at the top of the generate_secrets.sh script. If you do, you have to also change the passwords in the envs/kafka*.env files, the files in client_properties/ for all the clients and connect_properties/10_main.properties.
For the love of god, if you use the generated keys and certificates outside of development, change the passwords and store the CA somewhere safe.
The db directory contains configuration for the database, including user passwords. Be sure to change them when actually deploying this somewhere. The passwords must be set accordingly in the services that use them, i.e., Kafka Connect (connect_properties), the prefilter, the UI, the ingestion controller (not yet included).
You can use a provided script to clone and build all the images at once.
Alternatively, you can build the individual images by hand:
- Clone the domainradar-colext repo. Follow its README to build the images!
- Clone the domainradar-input repo and use dockerfiles/prefilter.Dockerfile to build it. Tag it with
domrad/loader
. - Clone the domainradar-ui repo and use the Dockerfile included in it to build the webui image. Tag it with
domrad/webui
.
You can adjust the scaling of the components by changing the variables in .env. Note that to achieve parallelism, the scaling factor must be less or equal to the partition count of the component's input topic. Modify the partitioning accordingly in prepare_topics and set the UPDATE_EXISTING_TOPICS
environment variable of the initializer service to 1
to update an existing deployment. Note that you can only increase the number of topics (but there can be more partitions than instances).
Start the system using:
docker compose --profile full up
Remember to always specify the profile in all compose commands. Otherwise, weird things are going to happen.
You can also add the -d
flag to run the services in the background. The follow-component-logs.sh script can then be used to “reattach” to the output of all the pipeline components, without the Kafka cluster.
All the included configuration files are set up for the default single-broker Kafka configuration. To use the two-brokers configuration or even extend it to more nodes, follow the instructions in the Adding a Kafka node section.
The configuration manager is not included in the full
profile. To use it, first refer to its README to see how the script should be set up on the host. Then add the configmanager
profile to the Compose commands:
docker compose --profile full --profile configmanager up -d config-manager
The “standalone” configurations do not include PostgreSQL and the MongoDB data aggregations. The standalone input controller can be used to send data for processing. First, start the system:
docker compose --profile col up -d
The standalone input controller can be then executed as follows:
docker compose --profile col run --rm -v /file/to/load.txt:/app/file.txt standalone-input load -d -y /app/file.txt
The command mounts the file from /file/to/load.txt
to the container, where the controller is executed to load this file in the direct mode (i.e. it expects one domain name per line) and with no interaction. The container is deleted after it finishes.
The col
profile only starts the collectors. To enable feature extraction, use the colext
profile instead.
Should you need to connect to Kafka from the outside world, the broker is published to the host machine on port 31013. You must modify your /etc/hosts file to point kafka1
to 127.0.0.1 and connect through this name.
Mind that in the default configuration, client authentication is required so you have to use one of the generated client certificates. You can also modify the broker configuration to allow plaintext communication (see below).
The override Compose file changes the setup so that Kafka cluster of two nodes is used. They both run in the combined mode where each instance works both as a controller and as a broker. Node-client communication enforces the use of SSL with client authentication; inter-controller and inter-broker communication are done in plaintext over a separate network (kafka-inter-node-network
) to reduce overhead.
Before using this setup, you should change the connection.brokers
setting in all the client_properties/*.toml client configuration files!
For some reason, the two-node setup tends to break randomly. I suggest to first start the Kafka nodes, then the initializer, and if it succeeds, start the rest of the services. You can use the compose_cluster.sh script which is just a shorthand for docker compose -f compose.yml -f compose.cluster-override.yml [args]
.
# If some services were started before, remove them
./compose_cluster.sh down
# Start the databases
./compose_cluster.sh up -d postgres mongo
# Start the cluster
./compose_cluster.sh up -d kafka1 kafka2
# Initialize the cluster
# If this fails, try restarting the cluster
./compose_cluster.sh up initializer
# Start the pipeline services
./service_runner.sh cluster up
If you want to test with more Kafka nodes, you have to:
- In generate_secrets.sh, change
NUM_BROKERS
and add an entry toBROKER_PASSWORDS
; generate the new certificate(s). - Add a new envs/kafkaN.env file:
- change the IP and hostname in
KAFKA_LISTENERS
andKAFKA_ADVERTISED_LISTENERS
, - change the paths in
KAFKA_SSL_KEYSTORE_LOCATION
and password inKAFKA_SSL_KEYSTORE_PASSWORD
.
- change the IP and hostname in
- Add the new internal broker IPs to
KAFKA_CONTROLLER_QUORUM_VOTERS
in all the kafkaN.env files. - Add a new service to the Compose file:
- copy an existing definition,
- change the IP address in the service.
- Update the
BOOTSTRAP
environment variable for the initializer service (in the Compose file). - Update the
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS
env. variable for the kafka-ui service (in the Compose file). - Update the
-s
argument in all the component services (in the Compose file). - Preferrably (though the clients should manage with just one bootstrap server):
- Update the .toml configurations for the Python clients (in the client_properties/ directory).
- Update the
bootstrap.servers
property in connect_properties/10_main.properties.
If you want to use SSL in inter-broker communication as well (for some reason), it should suffice to change KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
in all envs/kafkaN.env files. Set the controller and internal listener to use SSL: CONTROLLER:SSL,INTERNAL:SSL
. Not tested in the current config.
If you want to enable plaintext node-client communication, you can switch the listener to plaintext. Modify the KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
in the envs/kafka1.env file to contain CLIENTS:PLAINTEXT
instead of CLIENTS:SSL
. This only applies to the internal clients, i.e. the ones connected to the isolated kafka-clients
network. For this to have effect on the “outside world” clients that connect through the forwarded port 31010, instead modiy CLIENTSOUT
.
To disable client authentication, change KAFKA_SSL_CLIENT_AUTH
to none
or requested
.
If you need to debug the Java-based apps, you can enable the Java Debug Wire Protocol. Add this to the target service:
environment:
- JAVA_TOOL_OPTIONS=-agentlib:jdwp=transport=dt_socket,address=0.0.0.0:8111,server=y,suspend=n
ports:
- "8111:8111"
Adjust the host port if you need. In IntelliJ Idea, you can then add a Remote JVM Debug run configuration.
- client_properties contains the configuration files for the pipeline components.
- connect_plugins is used to load plugins to the Kafka Connect instance. Note that the MongoDB connector is added to the container at build.
- connect_properties contains the definitions of the Kafka Connect connectors.
- db contains the initialization scripts and configuration files for the database management systems. The passwords for the users, set only during the first execution of the services, are defined in the .secrets files.
- dockerfiles contains supplementary Dockerfiles:
- initializer.Dockerfile builds a simple container with the two scripts from kafka_scripts/,
- generate_secrets.Dockerfile builds a container with the JRE to run the secrets generation procedure. It is used through the generate_secrets_docker script.
- kafka_connect.Dockerfile builds a container based on domrad/kafka-connect that contains the MongoDB connector.
- run_aggregation.Dockerfile builds a container with the Mongo shell to execute the aggregations.
- envs contains the environment variables that control the settings of Kafka and Kafbat UI.
- extractor_data contains the data files for the feature extractor, created in the DomainRadar research.
- geoip_data contains the MaxMind GeoLite2 databases.
- kafka_scripts contains the scripts for the initializer.
- misc contains a list of 400,000 domain names for testing, an SQL that inserts 200 domain names to PostgreSQL for testing, and the configuration for the secrets generation procedure.
- mongo_aggregations contains, well, various example MongoDB aggregations and a common script that executes them to create a view.