visit
Google’s MLKit library finds text in images and this works well for simple text but for tabular data such as receipts or tables, it does not. If you’ve ever used an OCR app you might have noticed that it works reasonably well when scanning a page with uniform text but when dealing with any kind of tables or text formatting, it just falls flat.
This article discusses an algorithm used in a mobile application I wrote that processes text to keep it in line. We’ll take a common example of a store receipt. In the example below you can see exactly the problem that I’m trying to solve. Google Lens does not prefer lines for text but will rather put together text that is physically closer together. This works sometimes but definitely does not work for receipts and other tabular data. See specifically how Google Lens put the price 1,49€ as a new line because it was physically far away from the lines on the left.
Line Text Scanner is an application developed specifically to scan tabular data and the results of scanning a receipt are in the top right corner. You can see how it respected the horizontal layout of the receipt and put the price together with the correct article.
InputImage image = InputImage.fromBitmap(mSelectedImage, 0);
TextRecognizer recognizer = TextRecognition.getClient();
recognizer.process(image)
.addOnSuccessListener(
new OnSuccessListener<Text>() {
@Override
public void onSuccess(Text texts) {
processTextRecognitionResult(texts);
}
})
.addOnFailureListener(
//
});
ENTRETIEN/BAZAR......
MPX FILM FRAICH30M
EPICERIE/BOISSONS.......
1,49€
265G POIS CHICHES HARICOTS ROUGES TR
0,95€
PECHE FRAICH FRTS
COOKIES SABLES CHO
1,25€
3,69€
2,39€
HYGIENE/BEAUTE.......
2,99€
20 ELAST COURTS WA
2 X BATISTE SHP SEC 6,59€ 13,18€
2è à -60%
-3,95€
MPX FIL DENTAIR 50 SOIN BLANCHEUR GOU
2,09€
7,29€
SURGELES/PRODUITS FRAIS...........
LAITIERE PV VANILL
1,69€
MALO F.FRAIS FRUIT COMTE AOP 6M 450G
2,35€
5,29€
QUICHE LORRAINE ESCAL.PANEES DE BL
2,79€
OEUFS PLEIN AIRX12
2,85€
MORIN MOUSSE CHOCO
3,25€
2,09€
LONGLEY FARM COT.C
1,99€
TOTAL HORS AVANTAGES
57,62€
18
NOMBRE D'ARTICLES
Now to our algorithm. The first thing we’ll do is process all the TextBlocks from Text to extract the individual lines. Each line has one or more Text.Element
object and we extract them all and put them into a single list.
List<Text.Element> textElements = new ArrayList<>();
for (int i = 0; i < blocks.size(); i++) {
Rect rect = blocks.get(i).getBoundingBox();
List<Text.Line> lines = blocks.get(i).getLines();
for (int j = 0; j < lines.size(); j++) {
List<Text.Element> elements = lines.get(j).getElements();
for (int k = 0; k < elements.size(); k++) {
Text.Element e = elements.get(k);
textElements.add(e);
}
}
At this point, we have an array of Text.Element
objects which all contain some text. In our example, the four Text.Elements
in block 0 are 'ENTRETIEN/BAZAR......’, ‘MPX’, ‘FILM’, and ‘FRAICH30M’. All of the elements are put into a list.
So now that we have all these Text.Element
objects in a list, how do we determine what goes after what? As seen from above, the OCR algorithm just recognized blocks of text but it made no effort to make correct sentences. First, we need to realize that Text.Element
object is bound by Rect object which has coordinates and dimensions. We use this fact to sort the Text.Elements
.
public int compare(Text.Element t1, Text.Element t2) {
int diffOfTops = t1.getBoundingBox().top - t2.getBoundingBox().top;
int diffOfLefts = t1.getBoundingBox().left - t2.getBoundingBox().left;
int height = (t1.getBoundingBox().height() + t2.getBoundingBox().height()) / 2;
int verticalDiff = (int)(height * 0.35);
int result = diffOfLefts;
if (Math.abs(diffOfTops) > verticalDiff) {
result = diffOfTops;
}
return result;
}
Overall, in our algorithm, we prefer horizontal lines along which we can find the Text.Element
objects.
private boolean isSameLine(Text.Element t1, Text.Element t2) {
int diffOfTops = t1.getBoundingBox().top - t2.getBoundingBox().top;
int height = (t1.getBoundingBox().height() + t2.getBoundingBox().height()) * 0.35;
if (Math.abs(diffOfTops) > height ) {
return false;
}
return true;
}
Here is the full text again.
ENTRETIEN/BAZAR.
MPX FILM FRAICH3OM 1,49€
EPICERIE/BO1SSONS.
265G POIS CHICHES 0,95€
HARICOTS ROUGES TR 1,25e
PECHE FRAICH FRTS 3,69e
COOKIES SABLES CHO 2,39e
HYGIENE/BEAUTE.....
20 ELAST COURTS WA 2,99€
2 X BATISTE SHP SEC 6,59€ 13,18€
2è à -60% -3,95€
MPX FIL DENTAIR 50 2,09e
SOIN BLANCHEUR GOU 7,29€
SURGELES/PRODUITS FRAIS....
LAITIERE PV VANILL 1,69€
MALO F.FRAIS FRUIT 2,35€
COMTE AOP 6M 450G 5,29€
QUICHE LORRAINE 2,79€
ESCAL.PANEES DE BL 2,85e
OEUFS PLEIN AIRX12 3,25€
MORIN MOUSSE CHOCO 2,09e
LONGLEY FARM COT.C 1,99e
TOTAL HORS AVANTAGES 57,62e
NOMBRE D'ARTICLES 18