IRIS MT

Introduction

IRIS MT is a manual training tool for IRIS Tesseract-OCR. It can be deployed with minimal changes to manually train and enhance the database of Tesseract-OCR. Manual Training is required when the Tesseract-OCR returns unfavorable characters when converting an image to a string. Manual training ensures higher accuracy.

Prerequisites

  1. Tesseract-OCR should be present in the Lib folder.

  2. Tesseract-OCR must be installed along with Tesseract training tools.

  3. IRIS MT folder containing IRISMT.exe and config.json must be present in the assets folder.

Configure IRIS MT

To configure IRIS MT -

  1. Open ./assets/IRIS MT/config.json.

  2. Make desired changes in the JSON data:

{
    "active -model": "Lib/Tesseract-OCR/testdata/eng.traineddata",
    "all_versions": [
        "5.0.0",
        "5.0.1"
    ],
    "checkpoint_dir": "assets/checkpoints",
    "current_version": "5.0.0",
    "iterations": "400",
    "lang": "eng",
    "langdata_dir": "Lib/Tesseract-OCR/langdata",
    "rollback_dir": "assets/model_backup.7z",
    "tessdata_dir": "Lib/Tesseract-OCR/tessdata",
    "trained_result": "Lib/Tesseract-OCR/tessdata/eng.traineddata",
    "traineddata_dir": "Lib/Tesseract-OCR/testdata/eng.traineddata",
    "trainingfiles_dir": "assets/training_files",
    "trainingtool_dir": "/Lib/Tesseract-OCR/training",
    "logFile_Path": "assets/logs/TestautoV2.log"
}

Note:

  • We are training only English data, and the trained data model is available in mentioned location (active_model).

  • The number of iterations (for training) can be increased or decreased by changing the value of iterations (increasing iterations will improve in accuracy, but training time will also increase).

Training IRIS

The following steps need to be followed to train IRIS:

  1. Start IRISMT.bat from the assets folder. The following window opens. To select training data files, click on the Select button.

2. A windows file browser opens. Navigate to the training data and select files (multiple training files can be selected). The training files supported are in .ttf format.

3. Once files are selected, click on the Train button to start training Tesseract – OCR.

4. Once the training is complete, the following message will be displayed.

Tesseract Version Rollback

Sometimes training Tesseract-OCR can result in inaccuracy for some particular fonts. This can be improved by training further with more iterations. But in some cases, there might be a need to revert to an older version of the trained files (depending on the desired output that IRIS was providing).

  1. Click on the Revert button. A hidden drop-down and a Rollback Button are displayed.

2. From the dropdown, the user can select the version to rollback to.

3. By clicking on the Rollback button, the selected version of trained data is replaced in the Tesseract-OCR.

Last updated