-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskVine Futures: High latency on chained tasks. #3892
Comments
@BarrySlyDelgado you are the futures expert, can you help out here? |
I'm not familiar with the issue with the Parsl TaskVine Executor. However, TaskVine Futures were implemented initially using TaskVine 's @gpauloski If you have any of the TaskVine logs from these runs, is there a way you can send them to me? |
Also @colinthomas-z80 can you give us some insight into the distance in performance between plain TaskVine and Parsl+TaskVine. |
The task throughput improvement for Work Queue was updated for TaskVine as well at Parsl 3038 While those PRs did improve throughput, the task submission pattern addressed is where one task is submitted and waited on until completion before the next task is submitted, and so on. This type of pattern unfortunately exposes the difficulty communicating tasks one at a time to the executor, and still remains to be inefficient. If your application allows, it would be better to submit the tasks without waiting on the result after each submission, rather appending the futures to an array and waiting on it as_completed. This way Parsl will hand all of the tasks to TaskVine in short order. |
parsl+taskvine integrated performance is disappointing to me but let's not tangle that discussion too much here with @gpauloski original pure-TaskVine issue... |
@BarrySlyDelgado, do you have a preferred way of sharing the logs? @colinthomas-z80, I am running a benchmark so the application is as simple as "submit N tasks given N workers and submit another tasks when one completes." The actual benchmark does use I understand that this kind of benchmark doesn't play to TaskVine's strengths (it's not the only benchmark I am hoping to run), but I'm still surprised by the delta between the basic example and the To @benclifford's point, the Parsl stuff may very well be a red herring. I used the |
Ok, to summarize, we really have two different things going on here: First - TaskVine Futures have a high latency when sequentially chained. This has two subproblems: 1 - The benchmark is calling @gpauloski would it make sense to pass the future to the next function, instead of evaluating it? The method of transferring the result back to the manager involves submitting another task for that purpose. @BarrySlyDelgado would it make sense to call File.contents instead? Second, it seems that the Parsl-TaskVine executor still has some higher than expected latency when dealing with sequentially chained tasks. I'm going to separate that out into issue #3894 since that is a separate problem. I agree that the sequentially-chained-task is not really our target workflow, but nonetheless, it does measure the submit-wait latency and improving that will have a variety of good downstream effects. |
@gpauloski logs can be emailed to me directly - [email protected] |
@dthain it would be better to have an option to directly stream the contents for a given file. However, It seems |
Hmm, do you have a test case showing that it doesn't work? cctools/taskvine/src/manager/vine_manager.c Line 6084 in ac9f0b5
|
Oh, wait a minute, there are several layers in play here: |
Right, within the python wrapper, |
Ah, my mistake for adding to the confusion:
|
I dug a bit more into my logs and noticed that my no-ops are not really no-ops and were taking ~0.5s. This turns out to be my module import overhead, so I'll probably need to figure out how to incorporate a "serverless" mode like in the Parsl executor. I still observe about a 50% slow down when calling I do notice that the 0.5s import overhead is present in both the actual task and the retriever task, so the total overhead ends up being 1s. Does the retriever have the side effect of importing dependencies of my task, and if so, is that intended? The retrieval task is very fast when I run a minimal example with no imports. @BarrySlyDelgado, here's the logs (with the 0.5s imports for all task types) if you want to look further. |
#3899 fixes the high latency for the futures example above. @gpauloski were you able observe overhead from imports with this fix as well? It is not intended that regular tasks also bring along imports. This could be an issue with how we are serializing the functions. |
@BarrySlyDelgado yeah, I still see the high import costs with the fix (<1 task/s). Commenting out all the imports gets me back to ~25 tasks/s. |
@JinZhou5042 here are the logs for two different runs with TaPS. The first run is a bag of tasks, and the second is a chain of tasks. All tasks are no-ops and I get ~2 tasks/sec in both cases (it was previously around ~1 task/sec before Barry's fix in #3899). Each directory contains the vine logs and some other logging by TaPS. taps-taskvine-issue-3892.tar.gz I used this commit 7c0510b of TaskVine (PR #3899) and this branch of TaPS https://github.com/proxystore/taps/tree/taskvine-issue-3892. To reproduce, I ran this for the bag of tasks run:
And this for the sequential run:
TaPS records each function execution time with a decorator. Looking at the |
@gpauloski I git cloned your repo but didn't find |
|
That works! I observe the very similar performance issue as you have. Is there a way to modify the source code using TaskVine for the purpose of enabling the serverless mode? |
To enable serverless mode, would you need to modify the app or the executor? I ask because there is a strong separation between the two in TaPS. I.e., you won't be able to modify the app to add any TaskVine specific features, rather you would have to enable it within the For example, the app I used above is defined here I did start trying to modify the |
Per the discussion with @dthain this morning, such slowdown was caused by the deserialization of |
Thanks, @JinZhou5042! This is super informative. I also ran into the issue of including functions in a library that were not defined top-level in a module (e.g., wrapped/decorated functions, partial functions, etc.). That said, I have been considering changing TaPS in a future update to support statically decorating the app task functions, so I think we could better support TaskVine serverless after that. (I probably won't get to that feature until the end of the month though.) Nice work narrowing it down to the result serialization. This is helpful---I could, in the meantime, run experiments just returning the raw data. The |
@gpauloski I am not sure if you can have such speedup by unpassing python object to the task, because your code is using non-serverless mode and when I tried that yesterday, it didn't give me any speedup. |
But we are pretty sure that cloudpickling taskvine-generated pkl files is the culprit to introduce such latency. I did a quick test to cloudpickle files created by taskvine and my little script, essentially they are the same data, dumping in the same way, running on the same machine, but the time cost to load each of them are significantly different: |
Oh, I believe I understand now. Just to make sure, the slow serialization of I wonder if the deserialization is also triggering imports as a side effect? |
deserialization of stuff will often trigger imports - if you have a particular binary pickle file/dump that is "bad" i can help you understand what it's actually doing on deserialization if you attach it here |
Jin, the two different pickled examples have the same size. Do they also have exactly the same content? ( |
@dthain They don't have the same content, the taskvine-created pkl file has |
@benclifford My local test pkl file: The taskvine-created one is 'bad' |
Since they one created by taskvine is much smaller but turned out it was slow, I am assuming it does some expensive 'import' stuff, which tries to import the |
Some of the discussion from #3901 may be relevant here. |
Ok, I think we finally ran the issue to ground. The Problem: @benclifford was correct in that the un-cloudpickling of the Data object resulted in a horrific number of imports (12,000 system calls) processed on each reference, which were lost b/c we fork for each function execution. Imports in the function itself are not such a problem b/c the function is an intrinsic part of the serverless process, and only gets loaded once. Workaround: The application could avoid this by only sending primitive types as arguments and results. Probably not a good solution for complex applications. Solution: @JinZhou5042 is putting together a few small PRs that will connect the FuturesExecutor to use the serverless mode properly, and then adjust the serverless mode to perform argument un-cloudpickling in the parent process (rather than the child) so that imports are only suffered once. More to follow tonight and tomorrow. |
I just went through the cloudpickle doc and it seems that there are two types of serializing a file: by value or by reference. An important difference between cloudpickle and pickle is that
By default, TaskVine cloudpickle dumps arguments by reference, which will try to collect all of the required modules and trigger the import of those modules when loading. However, if we leverage the new feature provided by cloudpickle, which is to explicitly indicate which module we want to serialize by value, we don't keep the reference and will not to trigger imports on loading. For example - This took me 0.8s or so while deserializing: import cloudpickle
from taps.apps.synthetic import Data
with open('data.pkl', 'wb') as f:
cloudpickle.dump(Data(None), f) And this only took me around 0.0001s: import cloudpickle
from taps.apps.synthetic import Data
import taps
cloudpickle.register_pickle_by_value(taps)
with open('data.pkl', 'wb') as f:
cloudpickle.dump(Data(None), f) |
Given that it's dumping data by reference, so the hoisting imports should definitely help. With data generated by this code (which is to dump by reference): import cloudpickle
from taps.apps.synthetic import Data
with open('data.pkl', 'wb') as f:
cloudpickle.dump(Data(None), f) Deserializing by hoisting import time
import taps
import cloudpickle
time_start = time.time()
with open('data.pkl', 'rb') as f:
cloudpickle.load(f)
print(time.time() - time_start) However, deserializing by import time
import cloudpickle
import taps.apps.synthetic
time_start = time.time()
with open('data.pkl', 'rb') as f:
cloudpickle.load(f)
print(time.time() - time_start) |
That said, we have three approaches -
I am going to adopt the first solution here and would like to know a more decent way to solve it. |
@gpauloski You could do some little changes to your application in order to support the serverless mode and is likely to get a relatively better performance once #3902 and #3903 are merged. Take the def run_bag_of_tasks(
engine: AppEngine,
task_count: int,
task_data_bytes: int,
task_sleep: float,
max_running_tasks: int,
) -> None:
"""Run bag of tasks workflow."""
max_running_tasks = min(max_running_tasks, task_count)
start = time.monotonic()
taskvine_serverless = True
if taskvine_serverless:
# declare the library
libtask = engine.executor.create_library_from_functions(
"test-library", noop_task, hoisting_modules=[Data, uuid, generate_data, random], add_env=False
)
engine.executor.install_library(libtask)
for i in range(1, task_count + 1):
# declare the function call
function_call = engine.executor.future_funcall(
"test-library", "noop_task",
generate_data(task_data_bytes),
output_size=task_data_bytes,
sleep=task_sleep,
task_id=uuid.uuid4(),
)
task_future = engine.executor.submit(function_call)
# wait for the completion of the function call
task_future.result()
rate = i / (time.monotonic() - start)
logger.log(
APP_LOG_LEVEL,
f'Completed {i}/{task_count} tasks '
f'(rate: {rate:.2f} tasks/s'
) In my machine I got this throughput (compared to ~2 task/s previously): |
Looks like you've moved beyond this a bit now, but for completeness here is some dissection of the two pickle files in comment #3892 (comment) I've disassembled the two files using pickletools (which is built into Python). I'm not expecting you to understand the whole assembly output but I included it alongside the discussion for completeness.
One description of this The other much longer pickle says:
which says (instead of importing that class) ... "here is the definition for a completely new class". so you'll see in there things like the actual function definition for init as compiled byte code. dill behaves pretty much the same way, although it might have diferent heuristics than cloudpickle for deciding when to send an import and when to send the entire definition. I usually prefer the import route - which is the opposite direction to the way this thread has been going! - because (in no particular order)
That doesn't mean I think you should go that way too, but I feel like the preferences I have developed are well founded on past experience with Parsl. |
@benclifford That's really informative, thank you for sharing your valuable insights! The pickletools output aligns perfectly with the actual test cases. I totally agree that Python's bytecode heavily depends on import behaviors, and deserializing bytecode by value in each function can lead to some unexpected performance issues. Your experience further validates that #3902 has taken a relatively correct approach. |
@gpauloski Here is the script to build the most recent cctools from github and you can find details here:
I think this would be the fastest way to incorporate the features we just merged. If you feel like getting the newest thing from conda is more comfortable to you, please let us know and we'll release a new version ASAP! |
Thanks, @JinZhou5042! I'm hoping to give it a try later today. |
@gpauloski Sounds good! Please feel free to email to me ([email protected]) if you have any issues on installing the cctools or enabling the serverless! |
I reinstalled taskvine and got serverless working. I'm getting ~120 tasks/s now. Thanks for quick debugging and PRs, @dthain and @JinZhou5042. I appreciate it! |
@gpauloski You are so welcome! |
This is related to PR #3876, but different enough that I figured I'd open it as a separate issue to track.
I was using the
vine.FuturesExecutor
and getting pretty poor task latency (> 1s) with no-op tasks (using a compute-zen3 node at CHI@TACC). To figure out if it was myvine.FuturesExecutor
executor, I also tested with the ParslTaskVineExecutor
and got similarly bad performance.I wrote a script to test round-trip task time, and it seems the
vine.FuturesExecutor
andTaskVineExecutor
are much slower than just creatingPythonTask
directly. Is there a reason for this performance difference or have I configured something wrong? I also see quite some variance in the results for the "executor" and "parsl" configurations (as low as 0.25 and as high as 10). The "basic" configuration is a pretty consistent 25 tasks/s.@benclifford pointed me at Issue #3432 which seems like a similar performance problem. It's unclear from the comments if the issue was only resolved in WorkQueue or also TaskVine.
The text was updated successfully, but these errors were encountered: