Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLStripCharFilter can not remove html tags #14089

Open
cangkuren opened this issue Dec 30, 2024 · 0 comments
Open

HTMLStripCharFilter can not remove html tags #14089

cangkuren opened this issue Dec 30, 2024 · 0 comments
Labels

Comments

@cangkuren
Copy link

cangkuren commented Dec 30, 2024

Description

`

public AnalyzerResult analyze(String text) throws IOException {
text = HtmlExtractor.extractTextFromHtml(text);
List tokens = new ArrayList<>();
List originalTexts = new ArrayList<>();
try (TokenStream stream = tokenStream("*", text)) {
stream.reset();
CharTermAttribute charTermAttribute = stream.addAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttribute = stream.addAttribute(OffsetAttribute.class);
while (stream.incrementToken()) {
tokens.add(charTermAttribute.toString());
originalTexts.add(text.substring(offsetAttribute.startOffset(), offsetAttribute.endOffset()));
}
}
return AnalyzerResult.builder().tokens(tokens).originalTexts(originalTexts).build();
}
`

public static String extractTextFromHtml(String content) { Document document = Jsoup.parseBodyFragment(content); return document.body().text().replace(" ", "").trim(); }

protected Reader initReader(String fieldName, Reader reader) { reader = new HTMLStripCharFilter(reader); reader = new JapaneseIterationMarkCharFilter(reader); return reader; }

`
@PostConstruct
@scheduled(cron = "0 0 0 * * *")
public void init() {
Logged.L.info("load Japanese config.");
String dict = japaneseDictConfig.getDict();
UserDictionary userDictionary = null;
try {
userDictionary = UserDictionary.open(new StringReader(dict));
} catch (Exception e) {
Logged.L.error("load japanese dict error", e);
}

    List<String> stopWords = stopWordConfig.getStopWords();
    CharArraySet stopSet = new CharArraySet(stopWords, true);
    stopSet.add(getDefaultStopSet());

    Tokenizer tokenizer = new JapaneseTokenizer(userDictionary, true, false, JapaneseTokenizer.Mode.SEARCH);

    TokenStream stream = new JapaneseBaseFormFilter(tokenizer);
    stream = new JapanesePartOfSpeechStopFilter(stream, getDefaultStopTags());
    stream = new CJKWidthFilter(stream);
    stream = new StopFilter(stream, stopSet);
    stream = new JapaneseReadingFormFilter(stream);

// stream = new JapaneseKatakanaStemFilter(stream);
stream = new JapaneseNumberFilter(stream);
stream = new LowerCaseFilter(stream);
this.tokenStreamComponents = new TokenStreamComponents(tokenizer, stream);
}
@test
public void test() throws SQLException, IOException {
String ss = "<span style="font-weight: bolder; color: var(--theme-color-black) !important; background: var(--module-background) !important;">背景";
MultiLanguageAnalyzer.AnalyzerResult analyzerResult = analyzer.analyze(ss);
System.out.println(analyzerResult.getOriginalTexts());

}`

The code is like this. I use lucene-analyzers kuromoji 8.11.4
If I do not filter html using jsoup, The output originalTexts will be 背景</span>. The html will still exist, does this result match the expectation?

If I use jsoup to filter the input first, the output will be 背景,

Version and environment details

org.apache.lucene lucene-analyzers-kuromoji 8.11.4
@cangkuren cangkuren changed the title HTMLStripCharFilter HTMLStripCharFilter can not remove all the html Dec 30, 2024
@cangkuren cangkuren changed the title HTMLStripCharFilter can not remove all the html HTMLStripCharFilter can not remove html tags Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant