HTMLStripCharFilter can not remove html tags #14089

cangkuren · 2024-12-30T03:17:36Z

Description

`

public AnalyzerResult analyze(String text) throws IOException {
text = HtmlExtractor.extractTextFromHtml(text);
List tokens = new ArrayList<>();
List originalTexts = new ArrayList<>();
try (TokenStream stream = tokenStream("*", text)) {
stream.reset();
CharTermAttribute charTermAttribute = stream.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute = stream.addAttribute(OffsetAttribute.class);
while (stream.incrementToken()) {
tokens.add(charTermAttribute.toString());
originalTexts.add(text.substring(offsetAttribute.startOffset(), offsetAttribute.endOffset()));
}
}
return AnalyzerResult.builder().tokens(tokens).originalTexts(originalTexts).build();
}
`

public static String extractTextFromHtml(String content) { Document document = Jsoup.parseBodyFragment(content); return document.body().text().replace(" ", "").trim(); }

protected Reader initReader(String fieldName, Reader reader) { reader = new HTMLStripCharFilter(reader); reader = new JapaneseIterationMarkCharFilter(reader); return reader; }

`
@PostConstruct
@scheduled(cron = "0 0 0 * * *")
public void init() {
Logged.L.info("load Japanese config.");
String dict = japaneseDictConfig.getDict();
UserDictionary userDictionary = null;
try {
userDictionary = UserDictionary.open(new StringReader(dict));
} catch (Exception e) {
Logged.L.error("load japanese dict error", e);
}

    List<String> stopWords = stopWordConfig.getStopWords();
    CharArraySet stopSet = new CharArraySet(stopWords, true);
    stopSet.add(getDefaultStopSet());

    Tokenizer tokenizer = new JapaneseTokenizer(userDictionary, true, false, JapaneseTokenizer.Mode.SEARCH);

    TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
    stream = new JapanesePartOfSpeechStopFilter(stream, getDefaultStopTags());
    stream = new CJKWidthFilter(stream);
    stream = new StopFilter(stream, stopSet);
    stream = new JapaneseReadingFormFilter(stream);

// stream = new JapaneseKatakanaStemFilter(stream);
stream = new JapaneseNumberFilter(stream);
stream = new LowerCaseFilter(stream);
this.tokenStreamComponents = new TokenStreamComponents(tokenizer, stream);
}
@test
public void test() throws SQLException, IOException {
String ss = "<span style="font-weight: bolder; color: var(--theme-color-black) !important; background: var(--module-background) !important;">背景";
MultiLanguageAnalyzer.AnalyzerResult analyzerResult = analyzer.analyze(ss);
System.out.println(analyzerResult.getOriginalTexts());

}`

The code is like this. I use lucene-analyzers kuromoji 8.11.4
If I do not filter html using jsoup, The output originalTexts will be 背景</span>. The html will still exist, does this result match the expectation?

If I use jsoup to filter the input first, the output will be 背景,

Version and environment details

org.apache.lucene lucene-analyzers-kuromoji 8.11.4

The text was updated successfully, but these errors were encountered:

cangkuren added the type:bug label Dec 30, 2024

cangkuren changed the title ~~HTMLStripCharFilter~~ HTMLStripCharFilter can not remove all the html Dec 30, 2024

cangkuren changed the title ~~HTMLStripCharFilter can not remove all the html~~ HTMLStripCharFilter can not remove html tags Dec 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLStripCharFilter can not remove html tags #14089

HTMLStripCharFilter can not remove html tags #14089

cangkuren commented Dec 30, 2024 •

edited

Loading

HTMLStripCharFilter can not remove html tags #14089

HTMLStripCharFilter can not remove html tags #14089

Comments

cangkuren commented Dec 30, 2024 • edited Loading

Description

Version and environment details

cangkuren commented Dec 30, 2024 •

edited

Loading