I am creating an app that involves pdf parsing by using pdftotext
package. It is working fine on google colab
for any pdf
but not working on streamlit
when I am uploading it with st.file_uploader()
for any pdf. Not sure if I need to transform the uploaded file before giving to pdftotext.PDF()
for parsing or not.
-
App is deployed on streamlit cloud
https://pdfparsingtest.streamlit.app/ -
Github Repo link:
https://github.com/sunshineeast/pdf_parsing_test/tree/main -
Requirements.txt & package.txt https://github.com/sunshineeast/pdf_parsing_test/blob/main/requirements.txt
https://github.com/sunshineeast/pdf_parsing_test/blob/main/packages.txt
code I have tried:
with st.container(border=True):
st.session_state.uploaded_file = st.file_uploader('Choose your **.pdf** file to upload', type="pdf")
if st.session_state.uploaded_file is not None:
st.success("File is uploaded")
if st.session_state.uploaded_file:
st.session_state.doc_parsed = st.session_state.uploaded_file.getvalue()
# st.session_state.doc_parsed = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))
# st.session_state.doc_parsed = StringIO(st.session_state.uploaded_file.getvalue())
st.write(st.session_state.doc_parsed)
# Not working on streamlit but working on google colab
with open(st.session_state.doc_parsed, "rb") as f:
st.session_state.pdf = pdftotext.PDF(f,physical=True)
st.write(st.session_state.pdf)
Error:
ValueError: This app has encountered an error. The original error message is redacted to prevent data leaks. Full error details have been recorded in the logs (if you're on Streamlit Cloud, click on 'Manage app' in the lower right of your app).
Traceback:
File "/home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
result = func()
^^^^^^
File "/home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 590, in code_to_exec
exec(code, module.__dict__)
File "/mount/src/pdf_parsing_test/pdftotext_streamlit_test.py", line 29, in <module>
with open(st.session_state.doc_parsed, "rb") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error from Manage App:
ValueError: embedded null byte
ββββββββββββββββββββββ Traceback (most recent call last) βββββββββββββββββββββββ
/home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptru
nner/exec_code.py:88 in exec_func_with_error_handling
/home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptru
nner/script_runner.py:590 in code_to_exec
/mount/src/pdf_parsing_test/pdftotext_streamlit_test.py:29 in <module>
Using streamlit:
streamlit==1.38.0
pandas==2.2.2
pdftotext==2.2.2
Code Working on Google Colab:
UPDATE:
I have also tried loading pdf
directly from repo instead of uploader and that worked:
with open("sample_pdfs/Nonlinear_Optimization_in_R_using_nlopt.pdf", "rb") as f:
st.session_state.pdf = pdftotext.PDF(f,physical=True)
Appreciate any help !!!