-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to latest Heritrix #345
Comments
Using 3.4.0-X prevents the status update in Advanced>Heritrix |
Initial integration in issue-345 branch. |
5cf1067 writes WARCs from the new Heritrix but for some reason they are not indexed and accessible with the previously configured OpenWayback. |
Completed more integration steps in ...despite the indexing scripts running periodically. The resulting CDX is still blank. |
It looks like a |
Adding the ...but index.cdx remains at 0 bytes. |
It seems like the WAIL logic is tied to |
to .warc files from WAIL code. Relevant to #345.
There still appears to be an issue with the CDXJ merging procedure in 1df85bc. The new CDXZ.GZ file is created but reset to 0 when merging with the existing index.cdx. |
|
The current master 7027b9b with OpenWayback 2.4.0 seems to work fine with regard to indexing, so maybe let's hold off of integrating the latest Heritrix just yet until we figure out the delta in 7027b9b and 3020f99 that is preventing WARC responses from being replayable. An interesting aside: despite the prefix for WARCs in the config in 3020f99 being |
Tried this again by pulling the latest release of Heritrix into the latest WAIL master (distrib package at https://github.com/internetarchive/heritrix3/releases/tag/3.4.0-20200518), started a crawl from the WAIL UI, and Heritrix never started. WAIL CLI (run with Python) reported:
heritrix_out.log reported an issue:
Asked about Java version on the #heritrix channel of the IIPC Slack. |
@ldko noted on the IIPC Slack #heritrix channel that Heritrix dropped support for Java 7 (August 2019) and to try using Java v8-11.
|
In Java 11 there are some larger files in the JDK like Contents/Home/lib/modules (137.4MB) and Contents/Home/lib/src.zip (57.5 MB) that don't play well with git. Removing these for testing so the branches can be merged. |
The issue-345-java11 branch was never pushed to GitHub. Files too big and when removed, segfault on Heritrix launch. Added the full jdk back into the WAIL source locally without pushing to GitHub and issues still occur when trying to launch Heritrix via what is essentially the current WAIL master branch with paths adjusted for new Heritrix and JDK version.
|
I attempted to update to the latest Heritrix and using the JDK 11 zulu16.30.19-ca-jdk16.0.1-macosx_aarch64 but again, the files are too large for GitHub. There should be a way to remotely pull in this asset at build time. Current issue-345 branch has latest Heritrix but it requires a newer Java than 7. |
Prior version of tomcat would not work with Java 11. There is still the issue of newer versions of Java being too large for GitHub but a pull of OpenJDK allows all services to start. Re:#345
A related aside, the indexing procedure in OpenWayback uses the filename of the WARC as part of the basis of the CDX by default. Newer versions of Heritrix compress using gzip by default (?), so .cdx.gz files are created. The indexing/coalescence procedure in WAIL is only checking for |
A new version was released on August 3, 2021: https://github.com/internetarchive/heritrix3/releases/tag/3.4.0-20210803 |
An alternative to bundling Heritrix might be to use the submodule approach and provide a reference in WAIL to the latest Heritrix and tweak as needed for WAIL or pull the dist from the Heritrix releases and tweak if the package in the format is not needed. Preliminary experiments with newer versions of Heritrix than 3.2.0 as bundled in the most recent release indicate that the same crawl configurations in the newer versions of Heritrix (e.g., issues like #458) would be remedied with a newer version of Heritrix, so this GH issue ought to be a priority (thus pinned). |
The submodule approach would work with Heritrix, but the issue is not the size of Heritrix but the size of the JDK that needs to be bundled for Heritrix to run. GitHub repos alone are insufficient for files the size of those contained in the JDK. |
This is a detail of packaging for release.
|
In the |
Attempting to replicate the current state of the EDIT: We only pull in the large |
After detecting architecture and serving the respective
Thus, more than just the |
Also, some bash logic for detecting platform and pulling in the respective platform's modules file is present in the |
OpenWayback has issues starting w/ the alternative JDK/JRE in the % export JAVA_HOME=/Applications/WAIL.app/bundledApps/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/
% export JAVA_HOME=/Applications/WAIL.app/bundledApps/Java/JavaVirtualMachines/temurin-17.jdk/Contents/Home/
% /Applications/WAIL.app/bundledApps/tomcat/bin/startup.sh produces a log in -Djava.endorsed.dirs=/Applications/WAIL.app/bundledApps/tomcat/endorsed is not supported. Endorsed standards and standalone APIs
in modular form will be supported via the concept of upgradeable modules.
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit. EDIT: EDIT2: This flag is used multiple times in catalina.sh EDIT3: Around line 397 in catalina.sh, the command: eval \"$_RUNJAVA\" \"$LOGGING_CONFIG\" $JAVA_OPTS $CATALINA_OPTS \
-Djava.endorsed.dirs=\"$JAVA_ENDORSED_DIRS\" -classpath \"$CLASSPATH\" \
-Dcatalina.base=\"$CATALINA_BASE\" \
-Dcatalina.home=\"$CATALINA_HOME\" \
-Djava.io.tmpdir=\"$CATALINA_TMPDIR\" \
org.apache.catalina.startup.Bootstrap "$@" start \
>> "$CATALINA_OUT" 2>&1 "&" is the culprit, namely the
...appears to start the tomcat service (it's displayed in
HTTP Status 500 - Unable to compile class for JSP:
type Exception report
message Unable to compile class for JSP: description The server encountered an internal error that prevented it from fulfilling this request. exception org.apache.jasper.JasperException: Unable to compile class for JSP: An error occurred at line: 1 in the generated java file An error occurred at line: 1 in the generated java file An error occurred at line: 1 in the generated java file An error occurred at line: 24 in the generated java file An error occurred at line: 29 in the generated java file An error occurred at line: 30 in the generated java file An error occurred at line: 55 in the generated java file An error occurred at line: 56 in the generated java file An error occurred at line: 1 in the jsp file: /index.jsp An error occurred at line: 1 in the jsp file: /index.jsp An error occurred at line: 1 in the jsp file: /index.jsp An error occurred at line: 1 in the jsp file: /index.jsp An error occurred at line: 5 in the jsp file: /index.jsp An error occurred at line: 9 in the jsp file: /index.jsp An error occurred at line: 12 in the jsp file: /index.jsp An error occurred at line: 12 in the jsp file: /index.jsp An error occurred at line: 14 in the jsp file: /index.jsp An error occurred at line: 14 in the jsp file: /index.jsp An error occurred at line: 15 in the jsp file: /index.jsp An error occurred at line: 15 in the jsp file: /index.jsp An error occurred at line: 15 in the jsp file: /index.jsp An error occurred at line: 18 in the jsp file: /index.jsp An error occurred at line: 19 in the jsp file: /index.jsp An error occurred at line: 19 in the jsp file: /index.jsp An error occurred at line: 20 in the jsp file: /index.jsp An error occurred at line: 20 in the jsp file: /index.jsp An error occurred at line: 20 in the jsp file: /index.jsp An error occurred at line: 25 in the jsp file: /index.jsp
An error occurred at line: 25 in the jsp file: /index.jsp
An error occurred at line: 25 in the jsp file: /index.jsp
An error occurred at line: 27 in the jsp file: /index.jsp
29: <jsp:include page="/WEB-INF/template/UI-footer.jsp" flush="true" /> An error occurred at line: 27 in the jsp file: /index.jsp
29: <jsp:include page="/WEB-INF/template/UI-footer.jsp" flush="true" /> An error occurred at line: 27 in the jsp file: /index.jsp
29: <jsp:include page="/WEB-INF/template/UI-footer.jsp" flush="true" /> An error occurred at line: 29 in the jsp file: /index.jsp
29: <jsp:include page="/WEB-INF/template/UI-footer.jsp" flush="true" /> Stacktrace: Apache Tomcat/7.0.30 It might be that this newer version of Java and older version of Tomcat are not compatible. Some details: /Applications/WAIL.app/bundledApps/tomcat/bin/bootstrap.jar:/Applications/WAIL.app/bundledApps/tomcat/bin/tomcat-juli.jar Server version: Apache Tomcat/7.0.30 Server built: Sep 2 2012 09:50:47 Server number: 7.0.30.0 OS Name: Mac OS X OS Version: 13.2.1 Architecture: aarch64 JVM Version: 17.0.4.1+1 JVM Vendor: Eclipse Adoptium
|
I updated to tomcat 10 in the heritrix-2022 branch but starting Tomcat using startup.sh after setting execution permissions on catalina.sh shows a 404 in the browser. Tomcat starts fine but the OpenWayback app is not initialized. |
The error in catalina.out:
is quite similar to iipc/openwayback#169. Was this change merged into the main branch of OpenWayback? |
Per the change in iipc/openwayback#198 -- within WAIL's OpenWayback, the EDIT: correction, the PR has the paths lead by a preceding "/" to imply absolute. This is how it is in WAIL's OpenWayback. A complete absolute path relative to the file system does not remedy the issue. |
Note that, as described in internetarchive/heritrix3#237, some of the beans like BDBCookieStorage have been changed (for example) to BDBCookieStore. The crawler configs generated by Heritrix ought to match what it expects, but WAIL generates its own crawl configs, which won't match the expectations of the crawler. |
A lot was done in the heritrix-update-oct-2024 branch. The latest Heritrix is working there. We also mitigated the large files in Git from JDK 17. An issue remains there that despite cdx-indexer generating a CDX file in the right location based on the WARCs that Heritrix created and an entry is displayed in the calendar view, on access OpenWayback says resource not available. Both JDK 1.7 and JDK 17 are included to account for the limitations of OpenWayback and Heritrix, respectively. |
I believe there is a race process that is first writing filename information to the index.cdx file then the CDX data is actually populated. This confuses the OpenWayback. Keep the index.cdx file open in VS Code to see it toggle between one line and many. |
|
fdd839b fixes this issue. Heritrix-provided URLs are crawled and replayable! 🙌 |
Tried this on another Apple MX machine from source fdd839b and upon attempting to start a job, Heritrix 3.4.0-20240909 reports:
SEVERE Failed to start bean 'bdb'; nested exception is java.lang.RuntimeException: com.sleepycat.je.DiskLimitException: (JE 7.5.11) Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited: maxDiskLimit=0 freeDiskLimit=5,368,709,120 adjustedMaxDiskLimit=0 maxDiskOverage=0 freeDiskShortage=396,890,112 diskFreeSpace=4,971,819,008 availableLogSize=-396,890,112 totalLogSize=1,266 activeLogSize=1,266 reservedLogSize=0 protectedLogSize=0 protectedLogSizeMap={} (in thread '1728049223 launchthread')
EDIT: related discussion from the Heritrix repo: internetarchive/heritrix3#340 |
New releases beyond the 3.4.0 version that was minted by the National Library of Iceland are being pushed to https://github.com/internetarchive/heritrix3/releases. It would be good to get a newer version in WAIL beyond 3.2.0.
EDIT: Newer releases are important with regard to crawl quality and other tickets. The barrier has been the bundled Java. See comments below for mitigation plans relative to development vs. release.
EDIT 20220826: Use the heritrix-2022 branch for a newer JDK and Heritrix. OpenWayback cannot use the newer JDK, it seems, but the newer Heritrix appears to run well there.
NOTE: Heritrix releases are now available at Maven Central instead of GitHub.
The text was updated successfully, but these errors were encountered: