Skip to content

nesfit/domainradar

Repository files navigation

DomainRadar

This repository contains a Docker Compose setup for a complete DomainRadar testing environment. It includes a Kafka cluster using encrypted communication, the prefilter, the pipeline components (collectors, data merger, feature extractor, classifier), a PostgreSQL database, a MongoDB database, Kafka Connect configured to push data to them, and a web UI for Kafka.

The compose.yml Compose file provides several services assigned to several profiles.

Services with exposed ports

  • kafka1, the first Kafka broker:
    • Exposed on 31013 (through the kafka-outside-world network).
    • Internally, clients use kafka1:9093.
    • SSL authentication, see below.
  • kafka-ui, Kafbat UI, a web UI for Kafka: 31000
    • No authentication is used!
  • kafka-connect, the Kafka Connect REST API: 31002
    • No authentication (will be changed).
    • Included in two flavors: kafka-connect-full and kafka-connect-without-postgres.
  • postgres, the PostgreSQL database: 31010
    • Password (SCRAM) authentication (probably will be changed, see below).
  • mongo, the MongoDB Community database: 31011
    • Password (SCRAM) authentication (probably will be changed, see below).

Other services

  • initializer invokes the wait_for_startup script that exits only when successfuly connected to Kafka, and the prepare_topics script that creates or updates the topics.
  • config-manager is the configuration manager. It requires the config_manager_daemon script to be executed on the host machine first.
  • standalone-input can be executed to load domain names into the system.
  • mongo-domains-refresher and mongo-raw-data-refresher execute the run_periodically script to run MongoDB aggregations.

Preparation

Data

Security

You need to generate a CA, broker certificates and client certificates. Ensure that you have OpenSSL and Java installed (JRE is fine). Then you can run:

./generate_secrets.sh

You can also use the included Docker image:

./generate_secrets_docker.sh

You can change the certificates' validity and passwords by setting the variables at the top of the generate_secrets.sh script. If you do, you have to also change the passwords in the envs/kafka*.env files, the files in client_properties/ for all the clients and connect_properties/10_main.properties.

For the love of god, if you use the generated keys and certificates outside of development, change the passwords and store the CA somewhere safe.

The db directory contains configuration for the database, including user passwords. Be sure to change them when actually deploying this somewhere. The passwords must be set accordingly in the services that use them, i.e., Kafka Connect (connect_properties), the prefilter, the UI, the ingestion controller (not yet included).

Component images

You can use a provided script to clone and build all the images at once.

Alternatively, you can build the individual images by hand:

  1. Clone the domainradar-colext repo. Follow its README to build the images!
  2. Clone the domainradar-input repo and use dockerfiles/prefilter.Dockerfile to build it. Tag it with domrad/loader.
  3. Clone the domainradar-ui repo and use the Dockerfile included in it to build the webui image. Tag it with domrad/webui.

Scaling

You can adjust the scaling of the components by changing the variables in .env. Note that to achieve parallelism, the scaling factor must be less or equal to the partition count of the component's input topic. Modify the partitioning accordingly in prepare_topics and set the UPDATE_EXISTING_TOPICS environment variable of the initializer service to 1 to update an existing deployment. Note that you can only increase the number of topics (but there can be more partitions than instances).

Usage (full system)

Start the system using:

docker compose --profile full up

Remember to always specify the profile in all compose commands. Otherwise, weird things are going to happen.

You can also add the -d flag to run the services in the background. The follow-component-logs.sh script can then be used to “reattach” to the output of all the pipeline components, without the Kafka cluster.

All the included configuration files are set up for the default single-broker Kafka configuration. To use the two-brokers configuration or even extend it to more nodes, follow the instructions in the Adding a Kafka node section.

Using the configuration manager

The configuration manager is not included in the full profile. To use it, first refer to its README to see how the script should be set up on the host. Then add the configmanager profile to the Compose commands:

docker compose --profile full --profile configmanager up -d config-manager

Usage (standalone)

The “standalone” configurations do not include PostgreSQL and the MongoDB data aggregations. The standalone input controller can be used to send data for processing. First, start the system:

docker compose --profile col up -d

The standalone input controller can be then executed as follows:

docker compose --profile col run --rm -v /file/to/load.txt:/app/file.txt standalone-input load -d -y /app/file.txt

The command mounts the file from /file/to/load.txt to the container, where the controller is executed to load this file in the direct mode (i.e. it expects one domain name per line) and with no interaction. The container is deleted after it finishes.

The col profile only starts the collectors. To enable feature extraction, use the colext profile instead.

Kafka

Should you need to connect to Kafka from the outside world, the broker is published to the host machine on port 31013. You must modify your /etc/hosts file to point kafka1 to 127.0.0.1 and connect through this name.

Mind that in the default configuration, client authentication is required so you have to use one of the generated client certificates. You can also modify the broker configuration to allow plaintext communication (see below).

Using Kafka with two nodes

The override Compose file changes the setup so that Kafka cluster of two nodes is used. They both run in the combined mode where each instance works both as a controller and as a broker. Node-client communication enforces the use of SSL with client authentication; inter-controller and inter-broker communication are done in plaintext over a separate network (kafka-inter-node-network) to reduce overhead.

Before using this setup, you should change the connection.brokers setting in all the client_properties/*.toml client configuration files!

For some reason, the two-node setup tends to break randomly. I suggest to first start the Kafka nodes, then the initializer, and if it succeeds, start the rest of the services. You can use the compose_cluster.sh script which is just a shorthand for docker compose -f compose.yml -f compose.cluster-override.yml [args].

# If some services were started before, remove them
./compose_cluster.sh down
# Start the databases
./compose_cluster.sh up -d postgres mongo
# Start the cluster
./compose_cluster.sh up -d kafka1 kafka2
# Initialize the cluster
# If this fails, try restarting the cluster
./compose_cluster.sh up initializer
# Start the pipeline services
./service_runner.sh cluster up

Adding a Kafka node

If you want to test with more Kafka nodes, you have to:

  • In generate_secrets.sh, change NUM_BROKERS and add an entry to BROKER_PASSWORDS; generate the new certificate(s).
  • Add a new envs/kafkaN.env file:
    • change the IP and hostname in KAFKA_LISTENERS and KAFKA_ADVERTISED_LISTENERS,
    • change the paths in KAFKA_SSL_KEYSTORE_LOCATION and password in KAFKA_SSL_KEYSTORE_PASSWORD.
  • Add the new internal broker IPs to KAFKA_CONTROLLER_QUORUM_VOTERS in all the kafkaN.env files.
  • Add a new service to the Compose file:
    • copy an existing definition,
    • change the IP address in the service.
  • Update the BOOTSTRAP environment variable for the initializer service (in the Compose file).
  • Update the KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS env. variable for the kafka-ui service (in the Compose file).
  • Update the -s argument in all the component services (in the Compose file).
  • Preferrably (though the clients should manage with just one bootstrap server):
    • Update the .toml configurations for the Python clients (in the client_properties/ directory).
    • Update the bootstrap.servers property in connect_properties/10_main.properties.

Using SSL for inter-broker communication

If you want to use SSL in inter-broker communication as well (for some reason), it should suffice to change KAFKA_LISTENER_SECURITY_PROTOCOL_MAP in all envs/kafkaN.env files. Set the controller and internal listener to use SSL: CONTROLLER:SSL,INTERNAL:SSL. Not tested in the current config.

Enabling plaintext node-client communication

If you want to enable plaintext node-client communication, you can switch the listener to plaintext. Modify the KAFKA_LISTENER_SECURITY_PROTOCOL_MAP in the envs/kafka1.env file to contain CLIENTS:PLAINTEXT instead of CLIENTS:SSL. This only applies to the internal clients, i.e. the ones connected to the isolated kafka-clients network. For this to have effect on the “outside world” clients that connect through the forwarded port 31010, instead modiy CLIENTSOUT.

To disable client authentication, change KAFKA_SSL_CLIENT_AUTH to none or requested.

Debugging the Java components

If you need to debug the Java-based apps, you can enable the Java Debug Wire Protocol. Add this to the target service:

environment:
    - JAVA_TOOL_OPTIONS=-agentlib:jdwp=transport=dt_socket,address=0.0.0.0:8111,server=y,suspend=n
ports:
    - "8111:8111"

Adjust the host port if you need. In IntelliJ Idea, you can then add a Remote JVM Debug run configuration.

Included files breakdown

  • client_properties contains the configuration files for the pipeline components.
  • connect_plugins is used to load plugins to the Kafka Connect instance. Note that the MongoDB connector is added to the container at build.
  • connect_properties contains the definitions of the Kafka Connect connectors.
  • db contains the initialization scripts and configuration files for the database management systems. The passwords for the users, set only during the first execution of the services, are defined in the .secrets files.
  • dockerfiles contains supplementary Dockerfiles:
    • initializer.Dockerfile builds a simple container with the two scripts from kafka_scripts/,
    • generate_secrets.Dockerfile builds a container with the JRE to run the secrets generation procedure. It is used through the generate_secrets_docker script.
    • kafka_connect.Dockerfile builds a container based on domrad/kafka-connect that contains the MongoDB connector.
    • run_aggregation.Dockerfile builds a container with the Mongo shell to execute the aggregations.
  • envs contains the environment variables that control the settings of Kafka and Kafbat UI.
  • extractor_data contains the data files for the feature extractor, created in the DomainRadar research.
  • geoip_data contains the MaxMind GeoLite2 databases.
  • kafka_scripts contains the scripts for the initializer.
  • misc contains a list of 400,000 domain names for testing, an SQL that inserts 200 domain names to PostgreSQL for testing, and the configuration for the secrets generation procedure.
  • mongo_aggregations contains, well, various example MongoDB aggregations and a common script that executes them to create a view.

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •