Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to connect db entries from the table "sites" to a belonging warc-file? #156

Open
mxnx1 opened this issue Jun 17, 2019 · 2 comments
Open

Comments

@mxnx1
Copy link

mxnx1 commented Jun 17, 2019

Hi brozzler-team,

I want to export database entries belonging to a specific warc-file, from the tables jobs, sites and pages.
I Know how connect those tables to each other, but i couldn't find a connection to the table captures or directly to the belonging warc-file.

Is it working via the "WARC_Date" in the warcinfo record of the warc-file and "last_claimed" in the table sites?

A hint Would be great. Thx.

@nlevitt
Copy link
Contributor

nlevitt commented Jun 17, 2019

You can set the warc prefix using warcprox-meta as shown here: https://github.com/internetarchive/brozzler/blob/master/job-conf.rst#using-warcprox-meta

If you don't, captures from all your jobs and sites will be mixed together in the same warcs.

@mxnx1
Copy link
Author

mxnx1 commented Jun 18, 2019

thank you for your reply.
I use the warc_prefix, but I have several warc-files with the same warc-prefix, differentiating through timestamp and some id, which are created automatically.

example of warc file names:
Chile_Google_Search_Countries-20190609202650584-13pemhtq-00000.warc.gz
Chile_Google_Search_Countries-20190526122808538-7aek1ud9-00000.warc.gz


on brozzler dashboard the navigation through it and to the captured content goes
via Jobs - sites - pages - wayback.
so the table entries are explicitly connected to the belonging warc-files.

i understand how the tables jobs, sites and pages are connected - via job_id and site_id.
But i am wondering how brozzler is connecting the warc-files to its table entries (jobs, sites, pages).

  • i can imagine an improvised connection, but it is not very explicit:
    I think about a combination between the warc-prefix and the date from "last_claimed" from the table sites, to find the matching warc-file via its filename or its WARC-Date.
    But the date from Warc-Date (warcinfo record) and last_claimed (table sites) are not totaly similar and differ one second.

  • i am missing an explicit corresponding field.

i need this connection for exporting the belonging informations (in jobs, sites, pages) about the warc-files from the database.

Can you tell me how brozzler connect the warc-files to its table entries jobs, sites, pages?


part of sites entry:

"active_brozzling_time": 31.814205646514893 ,
"claimed": false ,
"cookie_db": <binary, 20.0KB, "53 51 4c 69 74 65..."> ,
"id": "7133eeeb-9e57-4ccf-837d-08e427c1a4fa" ,
"ignore_robots": true ,
"job_id": "google_search_countries_09062019" ,
"last_claimed": Sun Jun 09 2019 20:26:49 GMT+00:00 , 				
"last_claimed_by": "xxxxxxx" ,
"last_disclaimed": Sun Jun 09 2019 20:27:21 GMT+00:00 , 
....

"warcprox_meta": {
"warc-prefix": "Chile_Google_Search_Countries"       				
}

example of warcinfo record:

WARC/1.0
WARC-Record-ID: urn:uuid:4cd61096-662e-442f-ad1a-fe5eb870cae4
WARC-Type: warcinfo
WARC-Filename: Chile_Google_Search_Countries-20190609202650584-13pemhtq-00000.warc.gz
WARC-Date: 2019-06-09T20:26:50Z
Content-Type: application/warc-fields
Content-Length: 99

software: warcprox 2.4b6
hostname: xxxxxxx
ip: xxxxxxxx
format: WARC File Format 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants