Cropping Image into Pages

When dealing with a multiple-page document that has been scanned as a single image, cutting or segmenting the document into individual pages might be essential for OCR and information extraction tasks, as well as training ML models with libraries like spaCy.

By cutting a multiple-page document into individual pages, you can optimize OCR accuracy, create targeted training data, improve ML model performance, identify and rectify errors effectively, and achieve scalability and reusability in the OCR and information extraction processes.Tesseract OCR performs text recognition on a per-page basis. By cutting the scanned image into separate pages, you can apply OCR to each page individually, allowing Tesseract to focus on one page at a time. Different pages within a document may have different layouts, such as different column arrangements, headers, footers, or graphics. The visual separation between sentences and paragraphs can be lost and spaCy may struggle to accurately detect sentence boundaries.

To split a multi-page image into individual pages you can use ImageMagick with the following commands:

convert /d/crop_example1.jpg -crop 25%x100% +repage /d/page_%d.jpg

It crops the image into 4 pages to the left/right sides but keeps the full vertical height. +repage is used to trim any excess canvas. The %d is a placeholder that will be replaced by a sequential number, like page_1.jpg, page_2.jpg etc.

Similarly you can use -crop 50%x100% to split the image into 2 pages and keeping the vertical height, -crop 50%x50% to split the image into 4 quadrants.