Using `File('<finename>').commit_shas` results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

jnareb · 2022-10-15T09:17:35Z

When trying to find all commits that changed given file (and then all projects that included given file at some point), I have tried to follow the example from oscar.py documentation: https://ssc-oscar.github.io/oscar.py/#oscar.File

from oscar import File
commits = File('minicms/templatetags/minicms_tags.py').commit_shas

Unfortunately, it does not work, and instead returns the following error:

OSError: Failed to close .tch "b'/da5_fast/f2cFullU.13.tch'": file not found
Exception ignored in: 'oscar.Hash.__del__'
OSError: Failed to close .tch "b'/da5_fast/f2cFullU.13.tch'": file not found
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "oscar.pyx", line 344, in oscar.cached_property.wrapper
  File "oscar.pyx", line 1578, in oscar.File.commit_shas
  File "oscar.pyx", line 574, in oscar._Base.read_tch
  File "oscar.pyx", line 523, in oscar._get_tch
  File "oscar.pyx", line 459, in oscar.Hash.__cinit__
OSError: Failed to open .tch file "b'/da5_fast/f2cFullU.13.tch'": file not found

In the discussion for woc-hack/mining-challenge-msr-2023#2 (where I originally created an issue for this problem) @audrism wrote:

f2c is available for version T only.
oscar.pyc only uses current version (U) which is absent.
getValues, on the other hand, checks if prior versions exist.

I propose that oscar use the same technique, that is use prior versions of f2c file if current version does not exist (possibly giving also some warning).

The text was updated successfully, but these errors were encountered:

audrism · 2022-10-17T12:53:30Z

Copying discussion from challenge as it helps clear misconceptions:

I don't quite understand your objections, and why you want to remove functionality from oscar.py (and limit areas of the research).

No functionality is removed, just misleading and badly thought of interfaces.

First, one can also find purpose for finding rarely encountered files, like .gitattributes.
and
P.S. As for the problem of generating excessive load on the server, isn't it what job schedulers (also known as workload managers) are for?
Second, I would assume that those File accessors that produce generators, like .commits, would not require storing all the results in the memory, but would generate it on the fly. Or am I wrong here?

Common files (README.md) produce giant values. To handle those one needs to return iterator on iterators (over keys and then over values). This is not the job of a key value database as none allow unlimited value size.
Unfortunately the common filenames are the typical use case for File class in oscar.py, hence it needs to be deprecated.

A better alternative is to

zcat /da?_data/basemaps/gz/c2fFull${ver}*.s | grep README\.

as a substantial fraction of the entire collection of commits is returned, there is no expectation of immediate
result nor a massive amount of memory is used up by a single value (consisting of millions of commit sha1s).

Running shell scripts in python processes is, generally, a very bad bad idea that goes against the core design of unix tools in general and stream processing used in WoC in particular. Python filters could be use on the command line if one wants to utilize python for processing of these streams, but sed/awk/sort/join are much more effective and transparent alternatives in most cases.

It is important to note that other maps, such as a2c or b2c may return huge values as well, but in such cases this is because of bad data (generic author ID, empty file, etc) and keys for such values are not a typical use case (as in f2X). getValues actually handles such cases by storing these giant values using different means, but I am not sure if python classes actually are able to access them (this needs to be documented, especially if the answer is negative).

jnareb · 2022-10-19T12:37:40Z

@audrism , could you please update https://ssc-oscar.github.io/oscar.py/, so that the information that oscar.File(path) is deprecated is visible in the documentation (and possibly add information about what to use instead)? Thanks in advance.

user2589 · 2022-10-19T17:09:00Z

@jnareb I'm planning to fix this instead of deprecating. I cannot give a timeline, but I'll update the issue once it's done.

jnareb · 2022-10-19T17:30:27Z

@user2589 thanks a lot in advance.

As I see it (but might be wrong, because I have not take a look at the code) it could look like this:

Based on <path>, File constructor finds correct shard or shards where the filename is in f2c basemap
Find most recent version of basemap (U, or T, or...), like lookup's getValues does
Read from said file, unzipping it, until <path>
Read line by line, lazily (i.e. use generator / iterator construction), until the end of <path> entries

Does this set of steps look right?

audrism · 2022-10-19T18:54:13Z

Yes, this is correct. There are two (or maybe three) independent tasks, however

finding the latest version for the database
getting the value out via iterator

Potentially third task concerns handling multiple backends
3) In fact version T exists in tch format for f2c and common filenames are stored separately as the value is too big for tch. E.g., commits for package.json are in /da5_fast/f2cFullT.13.tch.large.22038b6d
I am not entirely sure if the current implementation of oscar would/could utilize such separate files

jnareb · 2022-10-19T19:16:16Z

Would it use *.s basemap files, or *.tch Tokyo Cabinet files? For sequential access (generators) you don't need random access of a database... I think.

audrism · 2022-10-20T19:59:43Z

finding the latest version for the database could be useful for both, also version T tch is already there for f2c and may already work
Yes, any pattern search would use *.s files

jnareb mentioned this issue Oct 15, 2022

Missing .tch file '/da5_fast/f2cFullU.13.tch', also no python3 on da5 woc-hack/mining-challenge-msr-2023#2

Open

audrism assigned loconous Oct 17, 2022

audrism added a commit to woc-hack/tutorial that referenced this issue Oct 17, 2022

removed f2X and py File references ssc-oscar/oscar.py#50

33e8ed4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `File('<finename>').commit_shas` results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

Using `File('<finename>').commit_shas` results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

jnareb commented Oct 15, 2022

audrism commented Oct 17, 2022 •

edited

Loading

jnareb commented Oct 19, 2022

user2589 commented Oct 19, 2022

jnareb commented Oct 19, 2022

audrism commented Oct 19, 2022

jnareb commented Oct 19, 2022

audrism commented Oct 20, 2022

Using File('<finename>').commit_shas results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

Using File('<finename>').commit_shas results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

Comments

jnareb commented Oct 15, 2022

audrism commented Oct 17, 2022 • edited Loading

jnareb commented Oct 19, 2022

user2589 commented Oct 19, 2022

jnareb commented Oct 19, 2022

audrism commented Oct 19, 2022

jnareb commented Oct 19, 2022

audrism commented Oct 20, 2022

Using `File('<finename>').commit_shas` results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

Using `File('<finename>').commit_shas` results in error due to missing .tch file "/da5_fast/f2cFullU.13.tch" #50

audrism commented Oct 17, 2022 •

edited

Loading