pytesseract.pytesseract.TesseractNotFoundError without using pytesseract in the code

I have two apps that use the same code and the same ‘requirements.txt’ and ‘packages.txt’ files. One is working well, but the other has an error while I don’t use pytesseract in the code.
pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it’s not in your PATH. See README file for more information.

This is my file requirements.txt for 2 applications :
#pip>=23.3.1
langchain
langchain-community
pysqlite3-binary
#streamlit==1.28.0
streamlit
requests
#llama_index
openai
#docx2txt
unstructured
unstructured[docx]
unstructured[pdf]
opencv-python-headless
chromadb
tiktoken
pytesseract==0.3.8

I tried to add tesseract-ocr to packages.txt file but it takes 1 day run and after error again

This is my code :
def load_data():
with st.spinner(text=“Loading information – hang tight! This should take 1-2 minutes.”):
loader = DirectoryLoader(“SOURCE_DOCUMENTS/”)
index = VectorstoreIndexCreator().from_loaders([loader])
return index

index = load_data()
if question := st.chat_input(“Your question”):
st.session_state.messages.append({“role”: “user”, “content”: question})

for message in st.session_state.messages:
with st.chat_message(message[“role”]):
st.write(message[“content”])

if st.session_state.messages[-1][“role”] != “assistant”:
with st.chat_message(“assistant”):
with st.spinner(“Thinking…”):
prompt = f"User Query: {question}\n\nContext: Give long and more detailed answer "
response = index.query(prompt, llm = ChatOpenAI(model=“gpt-4-1106-preview”))
engine_link = “https://api.openai.com/v1/chat/completions
headers = {“Authorization”: f"Bearer {api_key}“}
prompt = f”… \n\nQuestion: {question}\nAnswer:"
payload = {
“messages”: [
{“role”: “system”, “content”: f"OpenAI/gpt-4-1106-preview"},
{“role”: “user”, “content”: prompt}
],
“model”: “gpt-4-1106-preview”,
}
response2 = requests.post(engine_link, headers=headers, json=payload)
if response2.status_code == 200:
response_data = response2.json()
response = response_data[“choices”][0][“message”][“content”]
st.write(response)
message = {“role”: “assistant”, “content”: response}
else:
st.error(f"Error: {response.status_code} - {response.reason}")
st.write(response)
message = {“role”: “assistant”, “content”: response}
st.session_state.messages.append(message)

1 Like

Hi @Jane1702

You may want to have a look at @snehankekre’s post:

Best,
Charly

1 Like

Hi @Charly_Wargnier ,

I followed it but when I add 2 of them in my packages.txt file , it took so long time (several hours to a day ) and after that it render error. Besides that , I don’t use tesseract in my code , why this error exist ? And why the other application works well without adding tesseract-ocr to packages.txt file while I use the same code and requirements.txt file for 2 applications?

1 Like

After waiting 3 hours , I have this after adding tesseract-ocr and tesseract-ocr-por …
Screenshot 2024-04-05 133053

1 Like

Successfully installed MarkupSafe-2.1.5 Pillow-10.3.0 PyYAML-6.0.1 SQLAlchemy-2.0.29 aiohttp-3.9.3 aiosignal-1.3.1 altair-5.3.0 annotated-types-0.6.0 antlr4-python3-runtime-4.9.3 anyio-4.3.0 asgiref-3.8.1 async-timeout-4.0.3 attrs-23.2.0 backoff-2.2.1 bcrypt-4.1.2 beautifulsoup4-4.12.3 blinker-1.7.0 build-1.2.1 cachetools-5.3.3 certifi-2024.2.2 cffi-1.16.0 chardet-5.2.0 charset-normalizer-3.3.2 chroma-hnswlib-0.7.3 chromadb-0.4.24 click-8.1.7 coloredlogs-15.0.1 contourpy-1.2.1 cryptography-42.0.5 cycler-0.12.1 dataclasses-json-0.6.4 dataclasses-json-speakeasy-0.5.11 deprecated-1.2.14 distro-1.9.0 effdet-0.4.1 emoji-2.11.0 exceptiongroup-1.2.0 fastapi-0.110.1 filelock-3.13.3 filetype-1.2.0 flatbuffers-24.3.25 fonttools-4.50.0 frozenlist-1.4.1 fsspec-2024.3.1 gitdb-4.0.11 gitpython-3.1.43 google-auth-2.29.0 googleapis-common-protos-1.63.0 greenlet-3.0.3 grpcio-1.62.1 h11-0.14.0 httpcore-1.0.5 httptools-0.6.1 httpx-0.27.0 huggingface-hub-0.22.2 humanfriendly-10.0 idna-3.6 importlib-metadata-7.0.0 importlib-resources-6.4.0 iopath-0.1.10 jinja2-3.1.3 joblib-1.3.2 jsonpatch-1.33 jsonpath-python-1.0.6 jsonpointer-2.4 jsonschema-4.21.1 jsonschema-specifications-2023.12.1 kiwisolver-1.4.5 kubernetes-29.0.0 langchain-0.1.14 langchain-community-0.0.31 langchain-core-0.1.40 langchain-text-splitters-0.0.1 langdetect-1.0.9 langsmith-0.1.40 layoutparser-0.3.4 lxml-4.9.4 markdown-it-py-3.0.0 marshmallow-3.21.1 matplotlib-3.8.4 mdurl-0.1.2 mmh3-4.1.0 monotonic-1.6 mpmath-1.3.0 multidict-6.0.5 mypy-extensions-1.0.0 networkx-3.2.1 nltk-3.8.1 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.19.3 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105 oauthlib-3.2.2 omegaconf-2.3.0 onnx-1.16.0 onnxruntime-1.15.1 openai-1.16.2 opencv-python-4.9.0.80 opencv-python-headless-4.9.0.80 opentelemetry-api-1.24.0 opentelemetry-exporter-otlp-proto-common-1.24.0 opentelemetry-exporter-otlp-proto-grpc-1.24.0 opentelemetry-instrumentation-0.45b0 opentelemetry-instrumentation-asgi-0.45b0 opentelemetry-instrumentation-fastapi-0.45b0 opentelemetry-proto-1.24.0 opentelemetry-sdk-1.24.0 opentelemetry-semantic-conventions-0.45b0 opentelemetry-util-http-0.45b0 orjson-3.10.0 overrides-7.7.0 packaging-23.2 pandas-2.2.1 pdf2image-1.17.0 pdfminer.six-20231228 pdfplumber-0.11.0 pikepdf-8.14.0 pillow-heif-0.16.0 portalocker-2.8.2 posthog-3.5.0 protobuf-4.25.3 pulsar-client-3.4.0 pyarrow-15.0.2 pyasn1-0.6.0 pyasn1-modules-0.4.0 pycocotools-2.0.7 pycparser-2.22 pydantic-2.6.4 pydantic-core-2.16.3 pydeck-0.8.1b0 pygments-2.17.2 pyparsing-3.1.2 pypdf-4.1.0 pypdfium2-4.28.0 pypika-0.48.9 pyproject_hooks-1.0.0 pysqlite3-binary-0.5.2.post3 pytesseract-0.3.8 python-dateutil-2.9.0.post0 python-docx-1.1.0 python-dotenv-1.0.1 python-iso639-2024.2.7 python-magic-0.4.27 python-multipart-0.0.9 pytz-2024.1 rapidfuzz-3.7.0 referencing-0.34.0 regex-2023.12.25 requests-2.31.0 requests-oauthlib-2.0.0 rich-13.7.1 rpds-py-0.18.0 rsa-4.9 safetensors-0.4.2 scipy-1.13.0 setuptools-69.2.0 shellingham-1.5.4 six-1.16.0 smmap-5.0.1 sniffio-1.3.1 soupsieve-2.5 starlette-0.37.2 streamlit-1.33.0 sympy-1.12 tabulate-0.9.0 tenacity-8.2.3 tiktoken-0.6.0 timm-0.9.16 tokenizers-0.15.2 toml-0.10.2 tomli-2.0.1 toolz-0.12.1 torch-2.2.2 torchvision-0.17.2 tornado-6.4 tqdm-4.66.2 transformers-4.39.3 triton-2.2.0 typer-0.12.1 typing-extensions-4.10.0 typing-inspect-0.9.0 tzdata-2024.1 unstructured-0.13.2 unstructured-client-0.18.0 unstructured-inference-0.7.25 unstructured.pytesseract-0.3.12 urllib3-2.2.1 uvicorn-0.29.0 uvloop-0.19.0 watchdog-4.0.0 watchfiles-0.21.0 websocket-client-1.7.0 websockets-12.0 wrapt-1.16.0 yarl-1.9.4 zipp-3.18.1
Checking if Streamlit is installed
Found Streamlit version 1.33.0 in the environment

────────────────────────────────────────────────────────────────────────────────────────

[10:59:40] :snake: Python dependencies were installed from /mount/src/support-client/requirements.txt using pip.
[10:59:40] :package: Processed dependencies!
Stopping…

[10:59:53] :arrows_counterclockwise: Updated app!
[11:24:05] :exclamation: The service has encountered an error while checking the health of the Streamlit app: Get “http://localhost:8501/healthz”: read tcp 10.12.205.179:55056->10.12.205.179:8501: read: connection reset by peer
Stopping…
[11:25:43] :exclamation: Streamlit server consistently failed status checks
[11:25:43] :exclamation: Please fix the errors, push an update to the git repo, or reboot the app.

1 Like

You’re using langchain loaders which probably are loading pdf files (?) and those probably need pytesseract to extract the content of the PDFs through ocr. Before installing pytesseract you need to be sure you installed tesseract in your system, otherwise the pytesseract installation won’t work. But this isn’t related to streamlit directly.

1 Like

But I have already installed pytesseract. There is not error pytesseract anymore. The problem is the app took more than 3 hours to deploy and after it stopped.

1 Like

pytesseract is the python library to use Tesseract. Tesseract is the engine. They are two different things. To use pytesseract you need to install the Tesseract engine in your system. I wouldn’t know why it is taking so long

1 Like

I have two applications with the same code , requirements.txt , packages.txt files. When I install pytesseract, one application works fine, but the other encounters the issue I mentioned.

This is my file packages.txt :
libgl1
poppler-utils
tesseract-ocr
tesseract-ocr-por

This is my file requirements.txt :
langchain
langchain-community
pysqlite3-binary
#streamlit==1.28.0
streamlit
#llama_index
openai
#docx2txt
unstructured
unstructured[docx]
unstructured[pdf]
opencv-python-headless
chromadb
tiktoken
pytesseract==0.3.8

1 Like