You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
List<String> stopWords = stopWordConfig.getStopWords();
CharArraySet stopSet = new CharArraySet(stopWords, true);
stopSet.add(getDefaultStopSet());
Tokenizer tokenizer = new JapaneseTokenizer(userDictionary, true, false, JapaneseTokenizer.Mode.SEARCH);
TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
stream = new JapanesePartOfSpeechStopFilter(stream, getDefaultStopTags());
stream = new CJKWidthFilter(stream);
stream = new StopFilter(stream, stopSet);
stream = new JapaneseReadingFormFilter(stream);
// stream = new JapaneseKatakanaStemFilter(stream);
stream = new JapaneseNumberFilter(stream);
stream = new LowerCaseFilter(stream);
this.tokenStreamComponents = new TokenStreamComponents(tokenizer, stream);
} @test
public void test() throws SQLException, IOException {
String ss = "<span style="font-weight: bolder; color: var(--theme-color-black) !important; background: var(--module-background) !important;">背景";
MultiLanguageAnalyzer.AnalyzerResult analyzerResult = analyzer.analyze(ss);
System.out.println(analyzerResult.getOriginalTexts());
}`
The code is like this. I use lucene-analyzers kuromoji 8.11.4
If I do not filter html using jsoup, The output originalTexts will be 背景</span>. The html will still exist, does this result match the expectation?
If I use jsoup to filter the input first, the output will be 背景,
Description
`
public AnalyzerResult analyze(String text) throws IOException {
text = HtmlExtractor.extractTextFromHtml(text);
List tokens = new ArrayList<>();
List originalTexts = new ArrayList<>();
try (TokenStream stream = tokenStream("*", text)) {
stream.reset();
CharTermAttribute charTermAttribute = stream.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute = stream.addAttribute(OffsetAttribute.class);
while (stream.incrementToken()) {
tokens.add(charTermAttribute.toString());
originalTexts.add(text.substring(offsetAttribute.startOffset(), offsetAttribute.endOffset()));
}
}
return AnalyzerResult.builder().tokens(tokens).originalTexts(originalTexts).build();
}
`
public static String extractTextFromHtml(String content) { Document document = Jsoup.parseBodyFragment(content); return document.body().text().replace(" ", "").trim(); }
protected Reader initReader(String fieldName, Reader reader) { reader = new HTMLStripCharFilter(reader); reader = new JapaneseIterationMarkCharFilter(reader); return reader; }
`
@PostConstruct
@scheduled(cron = "0 0 0 * * *")
public void init() {
Logged.L.info("load Japanese config.");
String dict = japaneseDictConfig.getDict();
UserDictionary userDictionary = null;
try {
userDictionary = UserDictionary.open(new StringReader(dict));
} catch (Exception e) {
Logged.L.error("load japanese dict error", e);
}
// stream = new JapaneseKatakanaStemFilter(stream);
@test
stream = new JapaneseNumberFilter(stream);
stream = new LowerCaseFilter(stream);
this.tokenStreamComponents = new TokenStreamComponents(tokenizer, stream);
}
public void test() throws SQLException, IOException {
String ss = "<span style="font-weight: bolder; color: var(--theme-color-black) !important; background: var(--module-background) !important;">背景";
MultiLanguageAnalyzer.AnalyzerResult analyzerResult = analyzer.analyze(ss);
System.out.println(analyzerResult.getOriginalTexts());
The code is like this. I use lucene-analyzers kuromoji 8.11.4
If I do not filter html using jsoup, The output originalTexts will be
背景</span>
. The html will still exist, does this result match the expectation?If I use jsoup to filter the input first, the output will be
背景
,Version and environment details
org.apache.lucene lucene-analyzers-kuromoji 8.11.4The text was updated successfully, but these errors were encountered: