Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to import a file #11

Merged
merged 3 commits into from
Dec 7, 2023
Merged

How to import a file #11

merged 3 commits into from
Dec 7, 2023

Conversation

caro401
Copy link
Collaborator

@caro401 caro401 commented Nov 16, 2023

Don't merge yet! This is work in progress until #10 is resolved, and the code updated to reflect those decisions.

This is an attempt to write up the discussion from #9 in a how-to style format, hopefully accessible to a Python-curious historian, but meaningful to a Python expert wanting to just achieve the result.

I'm looking for feedback from someone on the team about how well this reads as the target audience, and from @makkus as to whether it's technically correct code, comments and prose.

I put the code just in markdown code blocks as it won't actually run correctly unless the example relative file path exists. Is that OK for now, or is a fully functional example in Jupyter or other format preferable (see also #4 )

@caro401 caro401 marked this pull request as draft November 16, 2023 11:53
Copy link
Collaborator

@makkus makkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

I guess this touches on two important concepts that need to be explained (possibly somewhere else):

  • importing a file or file_bundle into kiara: this means copying the byte blob into kiara process memory or into the physical kiara data store (if using 'store'). And why we don't directly use the file on disk (because we can't rely on the file not being changed externally, which would make any cached/past operations that used this file as input invalid or irrelevant). And what that means: extra disk usage, which could be relevant for very large files (not that we can do anything about that)

  • storing values: this is not always necessary (in fact, in our example notebooks I think we never have done it), but if you want to keep a value after you restart your Python process without having to re-compute it, it needs to be stored while it is still available (your current Python process). This also implies storing every input and intermediate value that was used to create the value you intend to store. Aliases play also into this, but are technically not necessary, they are only used as human-readable references to values, so not sure how/where to explain that.

@makkus
Copy link
Collaborator

makkus commented Nov 16, 2023

Ah, we might also want to have a section about 'file_bundles' (basically folders, but can also be archives or anything else where we have more than one file that belong together in an important way). Not sure if that needs to be referenced here, but personally I'd probably be curious what to do if I have that instead of a single file.

@caro401
Copy link
Collaborator Author

caro401 commented Nov 16, 2023

two important concepts that need to be explained (possibly somewhere else)

yes, I think we need to get a lot of clarity on what the store is, how it works, why it exists, and what onboarding/import means on a technical level. I don't think that content belongs here (although a link would probably be useful), as I imagine these bits of docs as short things you come to when you have a specific problem, and you need a specific answer to get your work done. I'll open a discussion issue and write up what I understand about the store and aliases, but my knowledge is very incomplete

we might also want to have a section about 'file_bundles'

Sure, I can add that, but don't know the answer. What is a file bundle, when would you use that rather than just importing lots of files individually? Is there anything you can do with a file bundle you can't do with a file or vice versa?

@makkus
Copy link
Collaborator

makkus commented Nov 16, 2023

What is a file bundle

A data type that contains one or several files, each identified by an internal (relative) sub-path within the bundle. The contained files are usually related in some way that is relevant to the computations that will be done on them (for example multiiple text files belonging to the same corpus)

when would you use that rather than just importing lots of files individually

whenever you have files that have that shared context, and would be fed into a downstream operation at the same time. Otherwise the downstream operation would need to have an input field for every individual file, which would be inefficient and only possible if you know exactly how many (sub-) files you will be dealing with.

Is there anything you can do with a file bundle you can't do with a file or vice versa?

Technically not I guess, but the question really is what operation would make sense for a single file that also makes sense for a file bundle. The only thing I can think of is doing the same operation on every sub-file of a bundle, which would be very inefficient and painful to have to do manually, so it'd be nice to have a module that can take a file-bundle and does that operation for all included files. But we haven't had a use-case like that so far, if I remember right.

For kiaras purposes, a file and a file_bundle are 2 different data types, and a module that takes one as input can't be used with the other. You'd have to use a 'pick.file' operation on a file bundle first, for example, if you have a single file input in an operation you want to use. Or you'd have to 'augment' a single file with an internal relative-path (which basically means adding information to data) if you wanted to convert a single file to a file_bundle (but that's not something we had to do so far I think).

@caro401
Copy link
Collaborator Author

caro401 commented Nov 16, 2023

Are there currently any examples or user stories of using a file bundle? are there any operations that currently use them? if not, I'll pull this file-bundle discussion into a low-priority issue for now, clean up the single file docs and move on

@caro401 caro401 marked this pull request as ready for review November 16, 2023 13:07
@caro401 caro401 changed the title DRAFT: Initial draft of how to import a file How to import a file Nov 16, 2023
@caro401 caro401 requested a review from makkus November 16, 2023 13:22
@makkus
Copy link
Collaborator

makkus commented Nov 16, 2023

Are there currently any examples or user stories of using a file bundle

If I remember right it's an important topic in language analysis, since there you usually have loads of separate text files.

@makkus
Copy link
Collaborator

makkus commented Nov 16, 2023

Ah, also, often when you import from an external archive service, they deliver a zip file which kiara would treat as a bundle, and then pick the single file in it. So it's part of pipelines in that way.

@caro401
Copy link
Collaborator Author

caro401 commented Nov 17, 2023

@makkus is this acceptable enough to merge, or what specific changes would you like? I've separated out the discussion of file_bundles and the store into separate issues #12 and #13

@caro401 caro401 merged commit 64c143e into main Dec 7, 2023
@caro401 caro401 deleted the how-to-load-file branch December 7, 2023 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants