-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to import a file #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me.
I guess this touches on two important concepts that need to be explained (possibly somewhere else):
-
importing a file or file_bundle into kiara: this means copying the byte blob into kiara process memory or into the physical kiara data store (if using 'store'). And why we don't directly use the file on disk (because we can't rely on the file not being changed externally, which would make any cached/past operations that used this file as input invalid or irrelevant). And what that means: extra disk usage, which could be relevant for very large files (not that we can do anything about that)
-
storing values: this is not always necessary (in fact, in our example notebooks I think we never have done it), but if you want to keep a value after you restart your Python process without having to re-compute it, it needs to be stored while it is still available (your current Python process). This also implies storing every input and intermediate value that was used to create the value you intend to store. Aliases play also into this, but are technically not necessary, they are only used as human-readable references to values, so not sure how/where to explain that.
Ah, we might also want to have a section about 'file_bundles' (basically folders, but can also be archives or anything else where we have more than one file that belong together in an important way). Not sure if that needs to be referenced here, but personally I'd probably be curious what to do if I have that instead of a single file. |
yes, I think we need to get a lot of clarity on what the store is, how it works, why it exists, and what onboarding/import means on a technical level. I don't think that content belongs here (although a link would probably be useful), as I imagine these bits of docs as short things you come to when you have a specific problem, and you need a specific answer to get your work done. I'll open a discussion issue and write up what I understand about the store and aliases, but my knowledge is very incomplete
Sure, I can add that, but don't know the answer. What is a file bundle, when would you use that rather than just importing lots of files individually? Is there anything you can do with a file bundle you can't do with a file or vice versa? |
A data type that contains one or several files, each identified by an internal (relative) sub-path within the bundle. The contained files are usually related in some way that is relevant to the computations that will be done on them (for example multiiple text files belonging to the same corpus)
whenever you have files that have that shared context, and would be fed into a downstream operation at the same time. Otherwise the downstream operation would need to have an input field for every individual file, which would be inefficient and only possible if you know exactly how many (sub-) files you will be dealing with.
Technically not I guess, but the question really is what operation would make sense for a single file that also makes sense for a file bundle. The only thing I can think of is doing the same operation on every sub-file of a bundle, which would be very inefficient and painful to have to do manually, so it'd be nice to have a module that can take a file-bundle and does that operation for all included files. But we haven't had a use-case like that so far, if I remember right. For kiaras purposes, a file and a file_bundle are 2 different data types, and a module that takes one as input can't be used with the other. You'd have to use a 'pick.file' operation on a file bundle first, for example, if you have a single file input in an operation you want to use. Or you'd have to 'augment' a single file with an internal relative-path (which basically means adding information to data) if you wanted to convert a single file to a file_bundle (but that's not something we had to do so far I think). |
Are there currently any examples or user stories of using a file bundle? are there any operations that currently use them? if not, I'll pull this file-bundle discussion into a low-priority issue for now, clean up the single file docs and move on |
If I remember right it's an important topic in language analysis, since there you usually have loads of separate text files. |
Ah, also, often when you import from an external archive service, they deliver a zip file which kiara would treat as a bundle, and then pick the single file in it. So it's part of pipelines in that way. |
Don't merge yet! This is work in progress until #10 is resolved, and the code updated to reflect those decisions.
This is an attempt to write up the discussion from #9 in a how-to style format, hopefully accessible to a Python-curious historian, but meaningful to a Python expert wanting to just achieve the result.
I'm looking for feedback from someone on the team about how well this reads as the target audience, and from @makkus as to whether it's technically correct code, comments and prose.
I put the code just in markdown code blocks as it won't actually run correctly unless the example relative file path exists. Is that OK for now, or is a fully functional example in Jupyter or other format preferable (see also #4 )