Text-extraction-app

I created an app using streamlit and tesseract OCR which allows a user to upload an image (jpg) of text and have the OCR extract the text. Streamlit and tesseract are in seperate docker containers, linked via a flask api, and my plan is to add a docker-compose to allow the app to be deployed easily. This is currently just a demonstrator but there are many features which could be added, such as named entity recognition, sentiment analysis etc. Anyone interested in contributing is welcome to do so, otherwise fork and use as the foundation for your own projects.
Cheers

9 Likes

That’s really cool! What’s the Flask part for, serving tesseract?

Hi Randy, thats correct, allows reusing the tesseract serving in other applications, and I wanted to try something microservices like with streamlit as frontent

1 Like

Hi Robo,

Really cool. I am a newbie, mostly do python scripting. I can use this at my work and there is need. Basic questions (I am still learning streamlit and Datascience in python): Can’t streamlit directly have this ‘upload -> extract text’ app ? As in why need docker (and I have no knowledge of docker). Also thought streamlit sort of obviated the need for Flask (don’t want to learn flask if I don’t have to).

Appreciate it!

Hi Raz
sure you can create a monolith and run without docker, but in the long run these are decisions you might regret :slight_smile:

1 Like

Davide fiocco has published a nice write up on deploying streamlit with a fastapi backend, taking a similar approach to mine with docker etc. Worth a read!

6 Likes

Can we do this for pdfs instead of images

Not as it stands, that would require a pull request. You could convert the pdf to a series of jpg however

hey !
I have also created an app on the same line.
While deploying my app on Streamlit, I am getting a permission error.
My question is how did you put the .exe file in streamlit?
“pytesseract.pytesseract.tesseract_cmd” I am talking about this.

Thanks!

@AJ_Rawat I seperated tesseract into a different container text-insights-app/tesseract-engine at master · robmarkcole/text-insights-app · GitHub

@robmarkcole Is this the only way of doing it?
Can’t we just have pytesseract.pytesseract.tesseract_cmd files in github repo and point the tesseract path to tesseract.exe?

I guess that depends where/how you plan to deploy the app, as .exe is for windows env whereas most hosted envs will be linux