Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different result with tesseract cli and leptess wrapper #41

Open
tcastelly opened this issue Jan 25, 2022 · 5 comments
Open

different result with tesseract cli and leptess wrapper #41

tcastelly opened this issue Jan 25, 2022 · 5 comments

Comments

@tcastelly
Copy link

Hello,

Thank you for this work!

I have a curious behavior, when I try to retrieve the text from the image bellow in command line:

time tesseract image.jpg output  

I have as result,

Coco Adel

But when I use the wrapper

fn main() {
    let mut lt = leptess::LepTess::new(Some("./tests"), "eng").unwrap();
    // let mut lt = leptess::LepTess::new(None, "eng").unwrap();
    lt.set_image("image.jpg");
    println!("{}", lt.get_utf8_text().unwrap());
}

I have:

rh

I've tried to use the traineddata from this repository. Or nothing. But same result.

Maybe the command line use default parameters.

Thanks in advance

image

@houqp
Copy link
Owner

houqp commented Jan 25, 2022

Hod did you install tesseract and libtesseract? What version of tessearct do you have?

@tcastelly
Copy link
Author

Thank you for your answer.

I'm on Gnu Archlinux, I installed:

pacman -S tesseract leptonica tesseract-data-eng
tesseract 4.1.1
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.5.2 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.0

@ccouzens
Copy link
Collaborator

My tesseract was installed through Fedora's dnf install tesseract command

tesseract 4.1.3
 leptonica-1.81.1
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE

My tesseract command gives the expected Coco Adel output.
Through leptess, I also get rh\n.

Converting the image to a png changed leptess's output slightly "Nr\n".

I created a new image with the same resolution and similar sized text and leptess was able to parse it correctly.
issue_41

I don't know why the command and API have different behaviour on your image. It may be worth checking to see if the command sets any additional options.

@houqp
Copy link
Owner

houqp commented Jan 30, 2022

Yeah, most likely that the command line uses different set of default options :(

@ongchi
Copy link

ongchi commented May 18, 2022

The default page seg mode for leptess is set to 6, which is block mode, and the default value for tesseract would be 3, which is auto.

Setting this variable manually would get the same result:

lt.set_variable(Variable::TesseditPagesegMode, "3").unwrap();

So, maybe the default value for page seq mode for leptess should set to 3 to consistent with tesseract, and also preventing someone get unexpected results.

FYI
The cli set default page seg mode to PSM_AUTO:

https://github.com/tesseract-ocr/tesseract/blob/be15b46c609e6d50f1665345d6e6fc128462593c/src/tesseract.cpp#L650

But PSM_SINGLE_BLOCK in library.

https://github.com/tesseract-ocr/tesseract/blob/be15b46c609e6d50f1665345d6e6fc128462593c/include/tesseract/publictypes.h#L166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants