Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potentially useful data in WoC #1

Open
audrism opened this issue Nov 18, 2020 · 7 comments
Open

Potentially useful data in WoC #1

audrism opened this issue Nov 18, 2020 · 7 comments

Comments

@audrism
Copy link

audrism commented Nov 18, 2020

The 128 tables b2cPtaPkgRPY.*.s in /da0_data/play/PYthruMaps/

have the API import data for all versions of all python files.

zcat /da0_data/play/PYthruMaps/b2cPtaPkgRPY.24.s | head -1
180000048a7ec70ed3e798f936b3cf63a696630e;3e0d624850d30ac75eb1bcfbf8f71b827a44d464;ppp0_openbroadcast;1386864075;ohrstrom <[email protected]>;django.forms;postman.utils.WRAP_WIDTH;django.db.transaction;__future__.unicode_literals;django.conf.settings;postman
.models.Message;django.utils.translation.ugettext

the format is

blob;commit;deforked project;time;author;pkg1;...;pkgn
@audrism
Copy link
Author

audrism commented Nov 22, 2020

As discussed, repo to package mappings are quite unreliable in PyPi
so I created them by parsing all versions of setup.py and setup.cfg files:
/da0_data/play/PYthruMaps/PkgName2PFullS.s /da0_data/play/PYthruMaps/P2PkgNameFullS.s

  1. There are still a few package names that can not be resolved by parsing (need to run the scripts), such as variable names/function calls

  2. the package may be implemented in multiple places: while P takes care of the forks, there are still often multiple repos that implement the same package, for example if the repo does nor rely on PyPi and copies/version controls external code. Not sure how many instances of such are there, but these could be identified by
    a) unusual number og packages they implement
    b) low centrality (in terms o, e.g., authors shared with other repos

@SAMFYB
Copy link
Collaborator

SAMFYB commented Nov 22, 2020

Hi Audris, Thank you for the help! In addition to Python projects, we are now also running co-occurrance on JS projects from these tables /da0_data/play/JSthruMaps/b2cPtaPkgJJS.*.gz.

@SAMFYB
Copy link
Collaborator

SAMFYB commented Nov 22, 2020

How much data can we store on the server?

@SAMFYB
Copy link
Collaborator

SAMFYB commented Nov 22, 2020

Also, is it the case that for these tables, entries from the same project will only appear in one of the tables, not multiple?

@audrism
Copy link
Author

audrism commented Nov 22, 2020

  • the version J is more that two years old: perhaps use more recent version R: /da0_data/play/JSthruMaps/b2cPtaPkgRJS.0.s

  • Storage: I created a folder on da0 where you can store project data
    /data/play/diversity-innovation
    Please let me know if you plan to use more than 1TB of disk space so that I can arrange it.

  • Tables: tables are grouped by blob, so projects will be distributed over all tables
    You might want to use PtaPkgRJS.*.s if you want to group by project.

@SAMFYB
Copy link
Collaborator

SAMFYB commented Nov 22, 2020

Thank you! From running the script on a small sample, we estimate the storage we need is just about 50G.

@SAMFYB
Copy link
Collaborator

SAMFYB commented May 7, 2021

@audrism
Hi Audris, just a heads-up, we are currently using ~720G of disk space on the server. It is unlikely we will use too much more than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants