Unable to use uploaded pdf file for pdftotext parsing on streamlit

I am creating an app that involves pdf parsing by using pdftotext package. It is working fine on google colab for any pdf but not working on streamlit when I am uploading it with st.file_uploader() for any pdf. Not sure if I need to transform the uploaded file before giving to pdftotext.PDF() for parsing or not.

  1. App is deployed on streamlit cloud
    https://pdfparsingtest.streamlit.app/

  2. Github Repo link:
    https://github.com/sunshineeast/pdf_parsing_test/tree/main

  3. Requirements.txt & package.txt https://github.com/sunshineeast/pdf_parsing_test/blob/main/requirements.txt

https://github.com/sunshineeast/pdf_parsing_test/blob/main/packages.txt

code I have tried:

with st.container(border=True):
    st.session_state.uploaded_file = st.file_uploader('Choose your **.pdf** file to upload', type="pdf")

    if st.session_state.uploaded_file is not None:
        st.success("File is uploaded")

if st.session_state.uploaded_file:
    st.session_state.doc_parsed = st.session_state.uploaded_file.getvalue()
    # st.session_state.doc_parsed = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))
    # st.session_state.doc_parsed = StringIO(st.session_state.uploaded_file.getvalue())
    st.write(st.session_state.doc_parsed)

    # Not working on streamlit but working on google colab
    with open(st.session_state.doc_parsed, "rb") as f:
        st.session_state.pdf = pdftotext.PDF(f,physical=True)

    st.write(st.session_state.pdf)

Error:

ValueError: This app has encountered an error. The original error message is redacted to prevent data leaks. Full error details have been recorded in the logs (if you're on Streamlit Cloud, click on 'Manage app' in the lower right of your app).
Traceback:
File "/home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "/home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "/mount/src/pdf_parsing_test/pdftotext_streamlit_test.py", line 29, in <module>
    with open(st.session_state.doc_parsed, "rb") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Error from Manage App:

ValueError: embedded null byte

────────────────────── Traceback (most recent call last) ───────────────────────

  /home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptru  

  nner/exec_code.py:88 in exec_func_with_error_handling                         

                                                                                

  /home/adminuser/venv/lib/python3.12/site-packages/streamlit/runtime/scriptru  

  nner/script_runner.py:590 in code_to_exec                                     

                                                                                

  /mount/src/pdf_parsing_test/pdftotext_streamlit_test.py:29 in <module>

Using streamlit:

streamlit==1.38.0
pandas==2.2.2
pdftotext==2.2.2

Code Working on Google Colab:

UPDATE:
I have also tried loading pdf directly from repo instead of uploader and that worked:

    with open("sample_pdfs/Nonlinear_Optimization_in_R_using_nlopt.pdf", "rb") as f:
        st.session_state.pdf = pdftotext.PDF(f,physical=True)

Appreciate any help !!!

Hi … can anyone help me with this issue ??

@Goyo, @blackary

do you have any idea about this ??

Would appreciate the help here !

Thanks

There is nothing to reopen here. I think passing the returned object from st.file_uploader directly to pdftotext.PDF should be enough.

From the docs of st.file_uploader:

Returns
(None or UploadedFile or list of UploadedFile)

The UploadedFile class is a subclass of BytesIO, and therefore is β€œfile-like”. This means you can pass an instance of it anywhere a file is expected.

1 Like

Yes this solved the problem. I was totally stuck on this.
Thanks alot. Really appreciate your help !!!

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.