-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Actor.charge() #346
feat: Actor.charge() #346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I don't have any major complain, just a few notes.
eventName, | ||
eventTitle: pricingInfo.title, | ||
eventPriceUsd: pricingInfo.price, | ||
timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think allowing the dev to pass in custom data connected to the event (like url
) is useful for both debugging and user validation. Do you see any issue with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not part of the whitepaper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mhamas This is not blocking this PR but something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written in another comment, I have concerns about creating a special dataset just for logs (because of size, charging, retention, permissions). We actually discussed this with @mtrunkat and @fnesveda and we decided against that, and instead we store the system log elsewhere (although I was actually in pro-dataset camp :D, I think it was mostly @fnesveda against and Mara decided at the end). But if you've learnt by the practice that a log dataset is always necessary, maybe it should be an integral part of the platform itself? Let's discuss offline @metalwarrior665 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janbuchar @B4nan @netmilk Ok, we had a chat with @mhamas and there are several serious concerns with the debug dataset that didn't cross my mind previously.
-
Consider there can be use-cases where developer would charge per LLM token so each charge would be in thousands like
Actor.charge('llm-token', 4562)
. This would create a huge pressure on the dataset, namely
a) Pushing thousands takes some time
b) push items cost for developer
c) timed storage cost for user -
I didn't fully realize that after stripping the developer provided metadata for each event, the chargingLogDataset now only contains a timestamp field that gives any extra information over just the plain chargingState (which is live visible on the platform). If the developer cannot put things like url there, then it is not very useful for both the user and the dev (except for local testing).
So we agreed (feel free to put your arguments in) that we should remove the current chargingLogDataset functionality before we figure out how to do it in a more bulletproof way. It shouldn't be any breaking adding it in later and it is not really required to make it work, devs can always to it themselves.
a) There are some ideas in the platform team about making this as platform feature that would not require storing it on user's account. Moving this forward depends of the feedback from both creators and users.
b) We might want to keep this functionality (ideally with the dev-provided metadata) for devs only to help with building the Actor. Now the question is how to do that in a way that doesn't escape to production.
In any way, I think we can ship this without the dataset for now so we have it out and think about the solution together. But happy to hear other arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some consideration, I'm in favor of making this opt-in on the code level, probably with some experimental flag. There are hundreds of other ways a Creator may waste money of their users, so I wouldn't stress over this to the point of removing this functionality entirely.
If the platform team is interested in supporting this on a higher level, all they would have to do would be to provide a charging dataset ID via environment variables.
Also, did you consider not making a separate dataset row for each occurence of an event when somebody does Actor.charge('asdf', 33333)
? It doesn't fix everything, but it's something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the platform team is interested in supporting this on a higher level, all they would have to do would be to provide a charging dataset ID via environment variables.
If we were to do this in the future, we would do it automatically in the API, it'd be on for all PPE runs and hence no special dataset ID would be passed to the Actor. Actor wouldn't be even aware that this is happening behind the scenes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Authoritative as it sounds, I don't think that's the only viable solution. Bear in mind that this would not help at all with local develoment, and we'd probably have to keep the current implementation of the charging log dataset anyways...
Huh, what else can opt-in mean? I mean a flag in code, Actor developer would have to enable it. I can see it would be useful for local development. Surely they can make it part of input schema too. |
It can mean "End user can toggle it" and "Actor developer can toggle it", and those have different implications. But on the SDK level, you can't really distinguish those that well - the developer can always enable it for all end users (and get them billed for a big dataset, not sure how much money that actually means). |
Oh I see, since we are talking SDK support I wasn't thinking about anyone else than a developer of the Actor. In that case, we could print a warning when this is enabled, saying this shouldn't be used in production since it can cause additional price bumps and/or performance issues. |
We can print a warning, or we can just remove it to be super sure somebody doesn't check it in by mistake :-). There is currently really no reason why this should be ever enabled on production for end users. If you really want to keep it in as general opt-in, is it possible to know in SDK "I'm running in production and I'm a public Actor" or "I'm running in production and I'm not being run by the owner of the Actor" and then throw from the SDK at that point to stop the Actor? |
You keep talking about a checkbox, I am talking about an option in
That's quite a weird reason to remove some functionality, you don't just run code in production, you need to first develop it. Wasn't this always meant for development? If you don't see a value during development, that would be another story. |
Btw recently we added periodic dump of chargingState to KV store locally. That feels like more conscise solution than bunch of individual dataset items files if it is only for local dev and we are not adding any other context data for each event. |
@@ -253,15 +265,17 @@ export class ChargingManager { | |||
throw new Error('ChargingManager is not initialized'); | |||
} | |||
|
|||
if (!this.pricingInfo[eventName]) { | |||
const price = this.isAtHome ? this.pricingInfo[eventName].price : 1; // Use a nonzero price for local development so that the maximum budget can be reached |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels weird and a step down from our reference implementation. We allowed passing in run ID so you can use real pricing for that project (I guess that would still work if we pass both ACTOR_RUN_ID
and IS_AT_HOME
), here we hardcode it at $1. I guess you would be ok to pass in the pricing somehow but we just didn't figure how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really didn't like the idea that you first need to publish the actor to figure things out, we need to do better. What about having default pricing info in the configuration too. Would be used locally only, and we could figure out a way to set it up on platform if its not there (and override it with whatever is set up on the platform otherwise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a hard pass on copying the pricing from a platform run from me as well - it's just too involved and convoluted. But I can imagine that any alternative I'd implement would end up in endless bikeshedding, so I'd prefer to resolve this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lets resolve later, not a huge deal and there is workaround. Config file is good but Im afraid of support questions why it didnt propagate to platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could have an info log when we use the platform pricing and config has one. It would be right at the top, so quite easy to see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PPE related matters look good to me, left some final thoughts on the log dataset.
if (this.purgeChargingLogDataset) { | ||
const dataset = await Dataset.open(this.LOCAL_CHARGING_LOG_DATASET_NAME); | ||
await dataset.drop(); | ||
} | ||
|
||
this.chargingLogDataset = await Dataset.open(this.LOCAL_CHARGING_LOG_DATASET_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just state one more time that I think that we really want to make sure that the log dataset will never be used in production for Actors deployed to the store. Only for the testing by the developer. There is currently 0 value to use it for the end users, it'll only bring them harm, confusion, and potentially huge costs.
I'd be more conservative here and don't even allow the creation of the dataset in the cloud, only locally. I'll not block this PR on it, it's ultimately up to you and was already discussed a lot, but please let's make sure the developers are really aware of this, and let's minimize the risk of this them getting this wrong.
Also, if anytime in the future, we have a reason to start using this dataset for end users, let's discuss this internally as that functionality would then probably belong to the platform itself, not SDK. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just state one more time that I think that we really want to make sure that the log dataset will never be used in production for Actors deployed to the store. Only for the testing by the developer. There is currently 0 value to use it for the end users, it'll only bring them harm, confusion, and potentially huge costs.
I'd be more conservative here and don't even allow the creation of the dataset in the cloud, only locally. I'll not block this PR on it, it's ultimately up to you and was already discussed a lot, but please let's make sure the developers are really aware of this, and let's minimize the risk of this them getting this wrong.
Also, if anytime in the future, we have a reason to start using this dataset for end users, let's discuss this internally as that functionality would then probably belong to the platform itself, not SDK. Thanks!
I think this is the METADATA_DATASET
in @metalwarrior665 repository. If so, that specific implementation did not provide any value to our tests except for confusion. It doesn't get purged properly between runs (locally), and we do not utilize it. The end users will see thousands of this dataset and they will ask the reason for the customer service.
However, I can understand the main intention for this specific implementation. I believe users would like to see a detailed cost report and see who paid what with which amount. If this can be shown differently in the platform, we can only use this on the development as far as it gets purged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can disable it in the cloud. This implementation should support purging locally.
const timestamp = new Date().toISOString(); | ||
|
||
if (this.chargingLogDataset !== undefined) { | ||
for (let i = 0; i < chargedCount; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: why not push one item for the whole chargedCount
? If the developer wants to simultaneously charge N things, I would expect a single item would suffice for debugging. Charging line by line will potentially create a lot of data with no real value added. Also, doing it like this rapidly increases the potential cost incurred by using this (by mistake) in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taken from the PoC by @metalwarrior665, but I agree with your suggestion. It's small enough of a change to add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was that there will be mapping of items and events and if you push a list, each event might have different metadata. But we diverged from the idea anyway so feel free to change
"test:e2e": "npm run test:e2e:scrapers && npm run test:e2e:sdk", | ||
"test:e2e:scrapers": "node test/e2e/runScraperTests.mjs", | ||
"test:e2e:sdk": "npm run test:e2e:sdk:tarball && node test/e2e/runSdkTests.mjs", | ||
"test:e2e:sdk:tarball": "npm run build && cd packages/apify && mv $(npm pack | tail -n 1) ../../test/e2e/apify.tgz", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very portable but c'est la vie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it works in CI and on my laptop, it's good enough for me.
*/ | ||
export class ChargingManager { | ||
readonly LOCAL_CHARGING_LOG_DATASET_NAME = 'charging_log'; | ||
readonly PLATFORM_CHARGING_LOG_DATASET_ID_KEY = 'CHARGING_LOG_DATASET_ID'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should come from apify/consts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I admit I have deep disdain for that package. But I honestly believe that it is justified not to use it here since this is all very adhoc and may get removed. Might be a reason to make these private, huh...
TODO