runQueryStream keeps loading entities into memory, not respecting stream backpressure #859
Labels
api: datastore
Issues related to the googleapis/nodejs-datastore API.
priority: p3
Desirable enhancement or fix. May not be included in next release.
type: feature request
‘Nice-to-have’ improvement, new feature or different behavior or design.
There is a
runQueryStream
method that returns aTransform
, which can be used in apipeline
.We have a script that needs to load big number of entities from Datastore (>100K), where each Entity has
id
and a string field (s
), which contains rather long string (say, 500Kb). This script needs to load all Entities and dump it into an ndjson-file. For the reproduction, ndjson part is not important, we can consume the data with "void-consumer".What we expect is that we create a Node.js
pipeline
, pipe the Datastore's stream into our "consumer", and Node.js will handle backpressure for us. Backpressure means that if Consumer is slower than producer (this is true in our case), Node.js will pause pulling in the data from the Producer (datastorerunQueryStream
call), not no go out-of-memory.Pipeline itself works, as proven by our logging. If I create an artificial Consumer that just does
setTimeout
100ms and then passes-through the data (which is then discarded and garbage-collected), our logging confirms that backpressure works and no more "objects" that Consumer can consume flows-in. We make our Consumer sequential (non-parallel) for the minimal reproduction.What happens in reality is that Datastore library keeps downloading data from the Cloud (as seen by Network activity with ~9Mb/second rate), while not passing this data to the Node.js stream, which creates an OOM-explosion.
Expected behavior is that
runQueryStream
should pause downloading data from the Cloud to respect Node's backpressure.Here I'm sharing somewhat minimal repro that I created. Please tell me if you'll need it even more minimal.
Important is that I've tried it both with legacy
grpc
library (we use it by default) and with@grpc/grpc-js
. The latter gives a different Network rate (~5Mb/s instead of ~9Mb/s), but behaves similarly in terms of not respecting backpressure.Minimal repro:
Runtime logs:
Network tab that consistently shows a "maximum" rate of ~9Mb/sec, regardless of backpressure or delay settings in the Consumer (e.g, setting timeout to 200ms or 1000ms doesn't change the network rate):
Environment details
@google-cloud/datastore
version: 6.4.6Steps to reproduce
See the minimal-repro code.
The text was updated successfully, but these errors were encountered: