Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransformerCLI fails for two records #84

Open
legolego opened this issue Mar 17, 2019 · 4 comments
Open

TransformerCLI fails for two records #84

legolego opened this issue Mar 17, 2019 · 4 comments
Assignees
Labels

Comments

@legolego
Copy link

Hello,
I found a couple more bugs, TransformerCLI failed for these patents and dropped out to the command prompt. The two source XML files are attached.

patents.zip

2019-03-15 17:49:05,394 INFO [main] TransformerCli - Record: 'US8299092B2' from D:\patents\ipg121030.zip:2659 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: begin 0, end 2, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source) at java.base/java.lang.String.substring(Unknown Source) at gov.uspto.patent.doc.xml.items.DocumentIdNode.read(DocumentIdNode.java:60) at gov.uspto.patent.doc.xml.fragments.CitationNode.readPatCitations(CitationNode.java:144) at gov.uspto.patent.doc.xml.fragments.CitationNode.read(CitationNode.java:63) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:113) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)

and
2019-03-16 09:34:18,090 INFO [main] TransformerCli - Record: 'USPP022671P2' from D:\patents\ipg120417.zip:435 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.base/java.lang.StringLatin1.charAt(Unknown Source) at java.base/java.lang.String.charAt(Unknown Source) at gov.uspto.common.text.StringCaseUtil.toTitleCase(StringCaseUtil.java:102) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:69) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)

@bgfeldm
Copy link
Contributor

bgfeldm commented Mar 18, 2019

I am not able to reproduce the errors above. The second one looks similar to the previous fixed issue #81 .

@legolego
Copy link
Author

Ok, I tried got the latest version and tried with the files I sent, and it didn't fail. I tried again with the large zip source files (ipg121030.zip and ipg120417.zip) and it did fail. I made small xml files of the previous patent numbers (US8299092B2 and USPP022671P2) and their respective next patent in the large source zip files, and they failed again. The new xml files are attached.
patents2.zip

@legolego
Copy link
Author

Here's one more place where the latest transformer code fails, file attached.
I think US9524869 is the file failing.
161220.zip
2019-03-28 18:30:27,349 INFO [main] TransformerCli - --- Start --- 2019-03-28 18:30:41,630 INFO [main] TransformerCli - Dump File[1]: D:\patents\161220.xml 2019-03-28 18:30:41,631 INFO [main] PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2019-03-28 18:30:41,635 INFO [main] PatentDocFormatDetect - PatentType fromContent: RedbookGrant 2019-03-28 18:30:42,300 INFO [main] TransformerCli - Record: 'US9524868B2' from D:\patents\161220.xml:2 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: begin 0, end 2, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source) at java.base/java.lang.String.substring(Unknown Source) at gov.uspto.patent.doc.xml.items.DocumentIdNode.read(DocumentIdNode.java:63) at gov.uspto.patent.doc.xml.fragments.CitationNode.readPatCitations(CitationNode.java:144) at gov.uspto.patent.doc.xml.fragments.CitationNode.read(CitationNode.java:63) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:113) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)

@bgfeldm bgfeldm self-assigned this Apr 1, 2019
@bgfeldm bgfeldm added the bug label Apr 1, 2019
@bgfeldm
Copy link
Contributor

bgfeldm commented Apr 1, 2019

Fixed this current issue with Index Out Of Bounds error on small document-numbers, early patent numbers, with length below 3.

Still need to look at the trailing document issue you noted above, believe it may be due to enclosed xml tags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants