如何在pytesseract中使用经过训练的数据?

问题描述:

使用此工具 http://trainyourtesseract.com/我希望能够将新字体与pytesseract.该工具给我一个名为* .traineddata

Using this tool http://trainyourtesseract.com/ I would like to be able to use new fonts with pytesseract. the tool give me a file called *.traineddata

现在我正在使用这个简单的脚本:

Right now I'm using this simple script :

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract as tes

results = tes.image_to_string(Image.open('./test.jpg'),boxes=True)
file = open('parsing.text','a')
file.write(results)
print(results)

如何使用我训练有素的数据文件,以便能够使用python脚本读取新字体?

How to I use my traineddata file so I'm able to read new font with the python script ?

谢谢!

edit#1:所以我知道*.traineddata可以与Tesseract一起用作命令行程序.所以我的问题还是一样,如何在python中使用训练有素的数据?

edit#1 : so I understand that *.traineddata can be used with Tesseract as a command-line program. so my question still the same, how do I use traineddata with python ?

edit#2:我的问题的答案在这里

edit#2 : the answer to my question is here How to access the command line for Tesseract from Python?

下面是带有选项的pytesseract.image_to_string()的示例.

Below is a sample of pytesseract.image_to_string() with options.

pytesseract.image_to_string(Image.open("./images*/xyz-small-gray.png"),
                                  lang="eng",boxes=False,
                                  config="--psm 4 --oem 3 
                                  -c tessedit_char_whitelist=-01234567890XYZ:"))

要使用自己训练有素的语言数据,只需将lang="eng"中的"eng"替换为您的语言name(.traineddata).

To use your own trained language data, just replace "eng" in lang="eng" with you language name(.traineddata).