Reading Time: 7 minutes

In the name of God

Hi guys, welcome back to my blog.

Creating an Optical Character Recognition (OCR) is hard and has a lot of challenges. Wait, What is OCR? OCR is basically a tech that allows reading text from images. One of the challenges is that a few different fields have to be combined and used together. Most important of these fields are Digital Image and MachineLearning.

OCR software requires lots of resources and energy to run. So usually, this software is run on powerful machines to have a better output. As a mobile developer, this introduces a challenge or even worse a blocker. Our mobiles are very limited and week. Even flagships might have difficulties performing these kinds of tasks.

Despite all these drawbacks, there are some promising tools that can help us significantly. With the help of Google creating OCR applications do not require that huge team of engineering.


Working with some APIs and components of Android is not fun at all. The camera API definitely stands on top of this list. For the exact same reason, last year at Google IO Jetpack team announced the Camerax library to ease this frustration. With the Camerax you can implement camera-based features like capturing, analyzing, previewing images a lot easier.

Luckily, Firebase has added a new SDK called MLkit (Machine Learning kit). MLKit allows us to do some fancy MachineLearning stuff mostly on images including face detection, text recognition, labeling, barcode scanning and etc.

I decided to build a live text detector with these 2 libs. It might sound very complicated but these two made it approachable. Extracting text from images requires MachineLearning models so usually OCR applications do the heavy calculations on the backend side. With the help of Firebase and their free MLKit we are able to do everything offline and on the device. This approach has a lot of benefits including more security, faster results, no internet and etc.

Since this is not a tutorial on how to get started with these libraries I highly suggest reading their samples and docs before going any further.

Final Goal

At first, let’s see what is the final goal of this post and then begin to build it.

As you can see I’m reading text from the live preview of the camera and then circle (square) around the text I managed to find on the preview again. Users can touch each box to see the extracted text.
Finally adding a capture button to allow reading text on a frozen image.

Setting up Camerax

To begin, I built a simple camera preview screen. Looking at docs and guides mostly helped me through. Camerax is very helpful and does all the heavy lifting. When I looked at the sample code for the first time it seemed very complicated but as I played around and added more code I got the hang of it.

When I compared my final code with the sample I noticed they look almost identical. So I suggest to copy from there and don’t waste your time on figuring out on your own. They also cover a few edge cases which

Here is the bare bone code to show the preview:

Cool so now we can see through our camera!

Processing each Frame

The user now can see each frame produced by the camera source. I also need to have that frame to be able to run my process. Camerax provides the ImageAnalysis use case. ImageAnalysis gives us the Analyzer callback to get a hold of the frames (images) passing through. Creating and hooking up with the camera is pretty much the same as the Preview use case.

Here you can see I’m passing the image to extractText() function to(you guessed it!) process and extract the texts. More on that function in a bit.

Now the setup of the camera is done we can move to text recognizing.

Firebase has done a good job getting us started with text recognition. Here is an overview of all the code required to get Textblocks:

As you can see here the result of this operation is FirebaseVisionText. We can call .text on this object to get the full text extracted from this frame. We are going to get back to this class later.

A few notes here:

1. I created an offline and on-device recognizer (the online version is more powerful but costs money)

2. Adding the metadata to the manifest is very important. it tells the Playstore to bundle the OCR model when downloading the app. Otherwise, the first time you start the recognizer will start the download automatically and will fail reading the images.

3. There a few ways we can create a FirebaseVisionImage. The easiest version here is to use fromMediaImage(image, firebaseRotation). But we are going to use one more variant in this project.

4. Since our extractText() is already in a background thread (using the Executor we passed when creating the analyzer), I wanted to remove the callback and have everything linear. With the help of Kotlin Coroutines and their integration library for Google Task API, I managed to flatten everything in our function.

5. Just before we return from extractText() we have to make sure our image object is closed. Otherwise, the Analyzer will not give us any more images and will wait for our closing call. Now that we have our extractText() function linear we can simply call the image.close() at the end of the function but for more complex usage make sure to handle that.

A few performance tips on ImageRecognizer:

1. It’s good to throttle the images passing to TextRecognizer. Calling Recognizer multiple times will cause poor performance and more overhead.

2. Size and resolution of the image can significantly improve/reduce the performance of TextRecognizer. So make sure you are using proper width and height when setting up ImageAnalyzer. Here for simplicity, I just used aspect ratio and left the size of the image to Camerax decision.

You can read more about these optimizations in TextRecognizer docs.

Circling extracted text

Alright, so far we managed to show the image, read the text from every frame. It’s time to show the boxes on-screen.

There are definitely multiple ways to achieve this. The simplest way I could think of was to add a ViewGroup on top of our Preview and add the boxes to it in the correct place. To achieve this I added a FrameLayout as the ViewGroup and add/remove simple ImageViews to the FrameLayout.

The ImageView has a box drawable so it will draw it wherever I put it. Now the most important questions:

  1. Is how big should the ImageViews be?
  2. How do we place them right on top of the correct Text?

Luckily, FirebaseVisionText class has a property called TextBlocks. Firebase recognizer reads frames in the form of blocks and returns to you these blocks. Every block represents a paragraph of text extracted from the frame. These blocks contain full information about the place and the size of the text on the image. The boundingBox property has all that information.

With the help of .toRectF() extension function from Androidx Core library we can easily convert it to a RectF. Now we can just use the .width() and .height() of this class to set on our ImageView right? That’s what I thought too. But the answer is no!

The Image frame used by the TextRecognizer and our ViewGroup has two different sizes and rotations. Our numbers have to account for that. Even though it sounds simple it took me some time to fix it. I created these function to correctly generate the ImageView size and Coordinate based on the Image and our FrameLayout


Finally, with those numbers fixed we can add our ImageViews to the FrameLayout

And voilà!

Our goal is achieved

Recognizing text from live preview

You can find the source code for this article on Github. Make sure to check out the MoreFeatures as I have added capture and scan as well as translating boxes with animation.

Happy coding


Please enter your comment!
Please enter your name here