IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

KrystianJanas · 2025-01-07T10:59:47Z

Describe the bug
For a long time now we have noticed a problem resulting from refreshing metrics that are collected by the main master-node from the worker-node. We are currently operating on Dockerfile, on the AWS cloud. We have 1 master-node and 8 worker-nodes.

The problem is that the master-node often restarts without any problem. After longer analyses, it turned out that the problem is "metrics", which cannot be turned off in any way, because you have not implemented such a method. It would be very useful in the application.

Sometimes it is possible to "bug" them, restarting the entire infrastructure or adding one more worker-node. But this is not a permanent solution, because by bugging the metrics, the problem is solved for 1-2 days.

The problem is that because of metrics, the worker-node often loses connection with the master-node when the task is started, which is why we get the task status "abnormal", and we have to manually check whether the task has already been completed or is still running. At the moment this is very burdensome for us, as each worker has at least 4-5 tasks running.

We're running master-node and each worker-node on the crawlab-pro:latest image.

Master-node configuration:

version: '3.4'
services:
  crawlab:
    image: crawlabteam/crawlab-pro:latest
    container_name: crawlab
    restart: always
    environment:
      - CRAWLAB_LICENSE
      - CRAWLAB_NODE_MASTER
      - CRAWLAB_MONGO_DB
      - CRAWLAB_MONGO_URI
      - CRAWLAB_DISABLE_METRICS
    volumes:
      - "/opt/.crawlab/master:/root/.crawlab"  # persistent crawlab metadata
      - "/opt/crawlab/master:/data"  # persistent crawlab data
    ports:
      - "9666:9666"  # exposed grpc port
    mem_limit: 7G
    logging:
      options:
        max-size: "15g"
        max-file: "4"


  auth:
    build: .
    container_name: auth
    environment:
      - CRAWLAB_FORWARD_PORT
      - HTPASSWD
    ports:
      - "80:8080"  # crawlab
    depends_on:
      - crawlab
    mem_limit: 1G
    logging:
      options:
        max-size: "2g"
        max-file: "5"

Worker-node configuration:

version: '3.5'
services:
  worker:
    image: crawlabteam/crawlab-pro:latest
    container_name: crawlab_worker
    restart: always
    environment:
      CRAWLAB_LICENSE: "${CRAWLAB_LICENSE}"
      CRAWLAB_NODE_MASTER: "N"  # N: worker node
      CRAWLAB_GRPC_ADDRESS: "${MASTER_NODE_IP}:9666"  # grpc address
      CRAWLAB_FS_FILER_URL: "http://${MASTER_NODE_IP}/api/filer"  # seaweedfs api
    volumes:
      - "/opt/.crawlab/worker:/root/.crawlab"  # persistent crawlab metadata
      - "/opt/crawlab/worker:/data"  # persistent crawlab data
      - "/opt/crawlab/worker/download:/download" # folder for storing downloaded files
    mem_limit: 7G
    logging:
      options:
        max-size: "3g"
        max-file: "3"

Expected behavior
Add possibility to disable/enable metrics flag, or fix this issue.

Screenshots

The text was updated successfully, but these errors were encountered:

KrystianJanas · 2025-01-07T11:00:34Z

@tikazyq please take a look on that. We have created similar issue few months ago, but it has been unfortunately forgotten.

tikazyq · 2025-01-09T14:16:57Z

Hi @KrystianJanas , thanks for your feedback. Thanks for using Crawlab Pro and I really appreciate your invaluable feedback. I noticed the issue as well but unfortunately there is no quick solution to solve the performance issue potentially caused by the metrics module, as the engine behind is prometheus. If you can, please record the resource consumption metrics (memory, cpu, disk io) for main processes such as crawlab-server, prometheus, weed, etc., so that we can precisely locate the root cause.

In the meantime, we are near a new major release (0.7.0) which is under the final stage of testing before the formal announcement. It is supposed to have addressed the issue you mentioned, given that we have got rid of most 3rd-party middleware dependencies such as Prometheus and SeaweedFS, which are replaced with native Golang code. If you are interested in the EA, please let me know and I'll push to the latest "test" version for your trial.

KrystianJanas · 2025-01-12T14:35:45Z

Thanks @tikazyq for your reply.
Yes, I'm interested in EA testing. Please, push the changes and let me know how to use them. We will be really glad of that!

KrystianJanas · 2025-01-20T10:27:58Z

@tikazyq any answer? we will be really glad of fastest answer us, we're using this tool each day with issues.

tikazyq · 2025-01-21T02:02:02Z

@KrystianJanas please refer below the instructions.

Pull docker image with tag as "test"

docker pull crawlabteam/crawlab-pro:test

Update your existing docker-compose.yml with the new image tag

...
    image: crawlabteam/crawlab-pro:test
...

Restart your docker containers

docker compose down
docker compose up -d

KrystianJanas · 2025-01-23T10:09:09Z

@tikazyq I tried to configure it, I successfully configured master-node, but:

menu at the left side is not visible

we can't configure workers (example worker 1, worker 2) - they're not connection with the master-node. I use the same config as in the basic post in this issue, only changing version from latest to test.
Logs from master-node (from the beginning, each 10 seconds they're regenerating):

 ERROR [2025-01-23 18:07:20] [MongoService] [MongoService] serverStatus error: (Unauthorized) not authorized on admin to execute command { serverStatus: 1, lsid: { id: UUID("4ea22b3d-8b4d-4b13-972e-75a22cce8ed1") }, $clusterTime: { clusterTime: Timestamp(1737626836, 1), signature: { hash: BinData(0, 4B42A194E994515493FCC4EBCE0113A2AB94044F), keyId: 7431649540823842825 } }, $db: "admin" }
 ERROR [2025-01-23 18:07:20] [DatabaseMetricService] error getting current metric: (Unauthorized) not authorized on admin to execute command { serverStatus: 1, lsid: { id: UUID("4ea22b3d-8b4d-4b13-972e-75a22cce8ed1") }, $clusterTime: { clusterTime: Timestamp(1737626836, 1), signature: { hash: BinData(0, 4B42A194E994515493FCC4EBCE0113A2AB94044F), keyId: 7431649540823842825 } }, $db: "admin" }
[GIN] 2025/01/23 - 18:07:23 | 200 |    3.875488ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:23 | 200 |    3.362444ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:28 | 200 |    4.090835ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:28 | 200 |    3.580681ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:33 | 200 |     3.94405ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:33 | 200 |    3.344703ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:38 | 200 |    3.923739ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:38 | 200 |    3.378544ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:43 | 200 |    4.368392ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:43 | 200 |    3.882629ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:48 | 200 |    4.053943ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:48 | 200 |    3.373024ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
 ERROR [2025-01-23 18:07:50] [MongoService] [MongoService] serverStatus error: (Unauthorized) not authorized on admin to execute command { serverStatus: 1, lsid: { id: UUID("38f479b8-6914-470a-9418-ca3efc0438b3") }, $clusterTime: { clusterTime: Timestamp(1737626866, 1), signature: { hash: BinData(0, 1D4BAFD54571825B5A37CC596DBAA890A62E3268), keyId: 7431649540823842825 } }, $db: "admin" }
 ERROR [2025-01-23 18:07:50] [DatabaseMetricService] error getting current metric: (Unauthorized) not authorized on admin to execute command { serverStatus: 1, lsid: { id: UUID("38f479b8-6914-470a-9418-ca3efc0438b3") }, $clusterTime: { clusterTime: Timestamp(1737626866, 1), signature: { hash: BinData(0, 1D4BAFD54571825B5A37CC596DBAA890A62E3268), keyId: 7431649540823842825 } }, $db: "admin" }
[GIN] 2025/01/23 - 18:07:53 | 200 |    4.081704ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:53 | 200 |    3.464947ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:54 | 200 |      46.951µs |       127.0.0.1 | GET      "/health"
[GIN] 2025/01/23 - 18:07:58 | 200 |     3.95296ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:07:58 | 200 |    3.457457ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:08:03 | 200 |    4.102885ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:08:03 | 200 |    3.628012ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:08:08 | 200 |    3.958161ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:08:08 | 200 |    3.530829ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:08:13 | 200 |     3.92847ms |      172.18.0.3 | GET      "/nodes/metrics?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:08:13 | 200 |    3.463147ms |      172.18.0.3 | GET      "/nodes?page=1&size=10&conditions=[]&sort=[]"
[GIN] 2025/01/23 - 18:08:14 | 200 |    4.564527ms |      172.18.0.3 | GET      "/schedules?page=1&size=10&conditions=[]&sort=[]"

Can you please check your configuration? We're missing some environments you didn't updated in Readme.md file?
It's really important for us to get fully functionally server with workers.

KrystianJanas · 2025-01-23T10:11:44Z

@tikazyq in logs Master Node we don't have also update after init new worker-node / existing worker node.
Before that I also made instructions you provided with me.

KrystianJanas · 2025-01-23T10:28:14Z

@tikazyq I solved problem related with Unauthorized in MongoDB master-node - I've added clusterMonitor permission inside the mongo.
But I have still issues related with no visibility workers by master-node.

KrystianJanas · 2025-01-23T11:02:59Z

@tikazyq Okay, I see you modified the ENV from MASTER_NODE_IP to CRAWLAB_MASTER_HOST and CRAWLAB_MASTER_HOST. I modified them and now I successfully connected workers into the master-node.

But the problem related with left sidebar is still actually.

MASTER_NODE_IP="IP"
CRAWLAB_LICENSE="XXX"
CRAWLAB_MASTER_HOST="IP"
CRAWLAB_MASTER_PORT="9666"
CRAWLAB_LOG_LEVEL="debug"

anaghaKruko · 2025-01-23T11:38:27Z

Please have a look at below bugs,

While trying to run an error is generated and no tasks running at the moment.
Earlier version had the ability to view the schedules, task, Git,Data, Dependency within a spider. Now we do not have that one.

Earlier version screenshot below

Current view

KrystianJanas added the bug Something isn't working label Jan 7, 2025

tikazyq added the performance Performance related label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

KrystianJanas commented Jan 7, 2025 •

edited

Loading

KrystianJanas commented Jan 7, 2025

tikazyq commented Jan 9, 2025 •

edited

Loading

KrystianJanas commented Jan 12, 2025

KrystianJanas commented Jan 20, 2025

tikazyq commented Jan 21, 2025

KrystianJanas commented Jan 23, 2025

KrystianJanas commented Jan 23, 2025

KrystianJanas commented Jan 23, 2025

KrystianJanas commented Jan 23, 2025

anaghaKruko commented Jan 23, 2025

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate abnormal tasks #1539

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate abnormal tasks #1539

Comments

KrystianJanas commented Jan 7, 2025 • edited Loading

KrystianJanas commented Jan 7, 2025

tikazyq commented Jan 9, 2025 • edited Loading

KrystianJanas commented Jan 12, 2025

KrystianJanas commented Jan 20, 2025

tikazyq commented Jan 21, 2025

KrystianJanas commented Jan 23, 2025

KrystianJanas commented Jan 23, 2025

KrystianJanas commented Jan 23, 2025

KrystianJanas commented Jan 23, 2025

anaghaKruko commented Jan 23, 2025

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

IMPORTANT: Metrics issue (Abnormal status of task) - master node automatically restarting, each worker running tasks generate `abnormal` tasks #1539

KrystianJanas commented Jan 7, 2025 •

edited

Loading

tikazyq commented Jan 9, 2025 •

edited

Loading