You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This problem was fixed several times already: either by extending the amount of memory (#796, #807) or by refactoring PIG script to minimize memory footprint during the RANK operation (CeON/CoAnSys#425).
After recent increase in number of publications (to 37M) we are struggling again with the memory related problem:
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:401)
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
Apparently there is an alternative processing path within COANSYS documents similarity involving custom ranking. The problem is this script at some point became incompatible with the rest of the documents similarity algorithm (as described in CeON/CoAnSys#427) but we could definitely use custom rank serializer and replace the following line causing problems:
wc_ranked = rank wc by count asc;
with custom ranking solution:
wc_tmp = order wc by count asc parallel 1;
STORE wc_tmp INTO '$outputPath$WORD_RANK_HR' using pl.edu.icm.coansys.similarity.pig.serializers.RankStorage();
wc_ranked = LOAD '$outputPath$WORD_RANK_HR' as (rank_num:long, count:long, term:chararray);
Firsts tests on 37M (and on 103M) of documents proved the proposed solution eliminates the memory related issue so we could:
This issue was reported in CoAnSys project: CeON/CoAnSys#432.
Pull request including the fix was issued: CeON/CoAnSys#433.
Until releasing new version of documents similarity CoAnSys module manual patch will be applied within IIS.
This problem was fixed several times already: either by extending the amount of memory (#796, #807) or by refactoring PIG script to minimize memory footprint during the RANK operation (CeON/CoAnSys#425).
After recent increase in number of publications (to 37M) we are struggling again with the memory related problem:
full log is available here: https://pastebin.com/dk2C8wLF
Pig execution plan is available here:
https://pastebin.com/bAUsCNjb
claiming again RANK operation to be the phase when the map task failed:
The text was updated successfully, but these errors were encountered: