Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using File('<finename>').commit_shas results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

Open
jnareb opened this issue Oct 15, 2022 · 7 comments
Assignees

Comments

@jnareb
Copy link

jnareb commented Oct 15, 2022

When trying to find all commits that changed given file (and then all projects that included given file at some point), I have tried to follow the example from oscar.py documentation: https://ssc-oscar.github.io/oscar.py/#oscar.File

from oscar import File
commits = File('minicms/templatetags/minicms_tags.py').commit_shas

Unfortunately, it does not work, and instead returns the following error:

OSError: Failed to close .tch "b'/da5_fast/f2cFullU.13.tch'": file not found
Exception ignored in: 'oscar.Hash.__del__'
OSError: Failed to close .tch "b'/da5_fast/f2cFullU.13.tch'": file not found
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "oscar.pyx", line 344, in oscar.cached_property.wrapper
  File "oscar.pyx", line 1578, in oscar.File.commit_shas
  File "oscar.pyx", line 574, in oscar._Base.read_tch
  File "oscar.pyx", line 523, in oscar._get_tch
  File "oscar.pyx", line 459, in oscar.Hash.__cinit__
OSError: Failed to open .tch file "b'/da5_fast/f2cFullU.13.tch'": file not found

In the discussion for woc-hack/mining-challenge-msr-2023#2 (where I originally created an issue for this problem) @audrism wrote:

f2c is available for version T only.
oscar.pyc only uses current version (U) which is absent.
getValues, on the other hand, checks if prior versions exist.

I propose that oscar use the same technique, that is use prior versions of f2c file if current version does not exist (possibly giving also some warning).

@audrism
Copy link
Contributor

audrism commented Oct 17, 2022

Copying discussion from challenge as it helps clear misconceptions:

I don't quite understand your objections, and why you want to remove functionality from oscar.py (and limit areas of the research).

No functionality is removed, just misleading and badly thought of interfaces.

First, one can also find purpose for finding rarely encountered files, like .gitattributes.
and
P.S. As for the problem of generating excessive load on the server, isn't it what job schedulers (also known as workload managers) are for?
Second, I would assume that those File accessors that produce generators, like .commits, would not require storing all the results in the memory, but would generate it on the fly. Or am I wrong here?

Common files (README.md) produce giant values. To handle those one needs to return iterator on iterators (over keys and then over values). This is not the job of a key value database as none allow unlimited value size.
Unfortunately the common filenames are the typical use case for File class in oscar.py, hence it needs to be deprecated.

A better alternative is to

zcat /da?_data/basemaps/gz/c2fFull${ver}*.s | grep README\.

as a substantial fraction of the entire collection of commits is returned, there is no expectation of immediate
result nor a massive amount of memory is used up by a single value (consisting of millions of commit sha1s).

Running shell scripts in python processes is, generally, a very bad bad idea that goes against the core design of unix tools in general and stream processing used in WoC in particular. Python filters could be use on the command line if one wants to utilize python for processing of these streams, but sed/awk/sort/join are much more effective and transparent alternatives in most cases.

It is important to note that other maps, such as a2c or b2c may return huge values as well, but in such cases this is because of bad data (generic author ID, empty file, etc) and keys for such values are not a typical use case (as in f2X). getValues actually handles such cases by storing these giant values using different means, but I am not sure if python classes actually are able to access them (this needs to be documented, especially if the answer is negative).

audrism added a commit to woc-hack/tutorial that referenced this issue Oct 17, 2022
@jnareb
Copy link
Author

jnareb commented Oct 19, 2022

@audrism , could you please update https://ssc-oscar.github.io/oscar.py/, so that the information that oscar.File(path) is deprecated is visible in the documentation (and possibly add information about what to use instead)? Thanks in advance.

@user2589
Copy link
Member

@jnareb I'm planning to fix this instead of deprecating. I cannot give a timeline, but I'll update the issue once it's done.

@jnareb
Copy link
Author

jnareb commented Oct 19, 2022

@user2589 thanks a lot in advance.

As I see it (but might be wrong, because I have not take a look at the code) it could look like this:

  1. Based on <path>, File constructor finds correct shard or shards where the filename is in f2c basemap
  2. Find most recent version of basemap (U, or T, or...), like lookup's getValues does
  3. Read from said file, unzipping it, until <path>
  4. Read line by line, lazily (i.e. use generator / iterator construction), until the end of <path> entries

Does this set of steps look right?

@audrism
Copy link
Contributor

audrism commented Oct 19, 2022

Yes, this is correct. There are two (or maybe three) independent tasks, however

  1. finding the latest version for the database
  2. getting the value out via iterator

Potentially third task concerns handling multiple backends
3) In fact version T exists in tch format for f2c and common filenames are stored separately as the value is too big for tch. E.g., commits for package.json are in /da5_fast/f2cFullT.13.tch.large.22038b6d
I am not entirely sure if the current implementation of oscar would/could utilize such separate files

@jnareb
Copy link
Author

jnareb commented Oct 19, 2022

Would it use *.s basemap files, or *.tch Tokyo Cabinet files? For sequential access (generators) you don't need random access of a database... I think.

@audrism
Copy link
Contributor

audrism commented Oct 20, 2022

  1. finding the latest version for the database could be useful for both, also version T tch is already there for f2c and may already work
  2. Yes, any pattern search would use *.s files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants