Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace buggy pig rank function with custom solution #927

Closed
marekhorst opened this issue Dec 3, 2018 · 2 comments
Closed

Replace buggy pig rank function with custom solution #927

marekhorst opened this issue Dec 3, 2018 · 2 comments

Comments

@marekhorst
Copy link
Member

This problem was fixed several times already: either by extending the amount of memory (#796, #807) or by refactoring PIG script to minimize memory footprint during the RANK operation (CeON/CoAnSys#425).

After recent increase in number of publications (to 37M) we are struggling again with the memory related problem:

java.lang.OutOfMemoryError
    at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:401)
    at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)

full log is available here: https://pastebin.com/dk2C8wLF

Pig execution plan is available here:
https://pastebin.com/bAUsCNjb

claiming again RANK operation to be the phase when the map task failed:

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_1524597382992_21544 wc_ranked   ORDER_BY    Message: Job failed!   
@marekhorst
Copy link
Member Author

marekhorst commented Dec 3, 2018

Apparently there is an alternative processing path within COANSYS documents similarity involving custom ranking. The problem is this script at some point became incompatible with the rest of the documents similarity algorithm (as described in CeON/CoAnSys#427) but we could definitely use custom rank serializer and replace the following line causing problems:

wc_ranked = rank wc by count asc;

with custom ranking solution:

wc_tmp = order wc by count asc parallel 1;
STORE wc_tmp INTO '$outputPath$WORD_RANK_HR' using pl.edu.icm.coansys.similarity.pig.serializers.RankStorage();
wc_ranked = LOAD '$outputPath$WORD_RANK_HR' as (rank_num:long, count:long, term:chararray);

Firsts tests on 37M (and on 103M) of documents proved the proposed solution eliminates the memory related issue so we could:

  1. issue a pull request to https://github.com/CeON/CoAnSys
  2. apply manual patch in IIS until switching to a new documents similarity release

@marekhorst
Copy link
Member Author

This issue was reported in CoAnSys project: CeON/CoAnSys#432.
Pull request including the fix was issued: CeON/CoAnSys#433.
Until releasing new version of documents similarity CoAnSys module manual patch will be applied within IIS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant