Uploaded pdf not parsing (Returns blank output) through llama in streamlit

I have uploaded a PDF file using st.file_uploader() from streamlit and then trying to parse it using LlamaParse() currently running on localhost.

Issue is I am getting blank list as output[] when I am using this uploaded_file object from the below code.

I am not sure if a returned uploaded object can directly be used to feed llama_parse as this PDF parsing works when I directly upload the PDF from my local machine directory into LlamaParse() instead of streamlit uploaded_file = st.file_uploader()

Looks like the file is not going through llama. Is there something that I need to do for using uploaded_file before feeding it into LlamaParse()?

I have also been through below links and used stringio but still stuck with errors or no results.

https://discuss.streamlit.io/t/expected-str-bytes-or-os-pathlike-object-not-

Stuck on this from days and also posted this on SO and really need help on this as it will otherwise jeopardize the whole project.

sample code:

import streamlit as st
import pandas as pd
import os
from io import StringIO
import nest_asyncio 

nest_asyncio.apply()
os.environ["LLMA_CLOUD_API_KEY"] = my_secret_key 
key_input = my_secret_key

from llama_parse import LlamaParse

############### Setting Configuration ###############

st.set_page_config(page_title="Pdf Parsing",
                    layout='wide',
                    initial_sidebar_state="expanded")

# title
st.markdown("<h1 style='text-align: center; font-size: 70px;color: black;letter-spacing: 4px;'>PDF Parsing</h1>", unsafe_allow_html=True)

st.write("checkpoint1")
with st.container(border=True):

    st.write("checkpoint2")
    st.session_state.uploaded_file = st.file_uploader('Choose your .pdf file to upload', type="pdf")

    if st.session_state.uploaded_file is not None:
        st.write(st.session_state.uploaded_file.name)
        st.success("File is uploaded")

        # https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader
        # https://discuss.streamlit.io/t/expected-str-bytes-or-os-pathlike-object-not-uploadedfile-for-pdf-file/10436
        st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))

if st.session_state.uploaded_file:

    # st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(st.session_state.uploaded_file)

    st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(st.session_state.uploaded_file_stringio)

    st.write('checkpoint3')
    st.write(st.session_state.doc_parsed)

Error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 11: invalid start byte
Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 35, in <module>
    st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))

Version info

Python 3.12.4
streamlit 1.38.0
llama_parse 0.5.0
pandas 2.2.2

According to the documentation I found you can pass a file object or just bytes. Trying to decode the bytes to text doesn’t make sense, since a pdf file is made of much more than encoded text.

Thank you for the quick reply.

  1. When I directly use uploaded pdf file then I get [] blank list output. Code" st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input).load_data(st.session_state.uploaded_file)
  2. When I use stringio on uploaded file then I am getting an error and as I understood from your reply that this step is not required.
  3. If I follow llama_github documentation and use SimpleDirectoryReader() then what is the path that I need to mention for streamlit uploaded file ?

I have tried below codes based on SimpleDirectoryReader():

 parser = LlamaParse(
    api_key=key_input,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="markdown",  # "markdown" and "text" are available
    verbose=True,
  1. Attempt
    file_extractor = {".pdf": parser}
    documents = SimpleDirectoryReader(
    "./st.session_state.uploaded_file", file_extractor=file_extractor
    ).load_data()

Error: ValueError: Directory ./st.session_state.uploaded_file does not exist.

  1. Attempt
    file_extractor = {".pdf": parser}
    documents = SimpleDirectoryReader(
    st.session_state.uploaded_file, file_extractor=file_extractor
    ).load_data()

Error:

Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 80, in <module>
    documents = SimpleDirectoryReader(
                ^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\llama_index\core\readers\file\base.py", line 259, in __init__
    if not self.fs.isdir(input_dir):
           ^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\fsspec\implementations\local.py", line 134, in isdir
    path = self._strip_protocol(path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\fsspec\implementations\local.py", line 233, in _strip_protocol
    if path.startswith("file://"):
       ^^^^^^^^^^^^^^^

I am not sure what else I can do. Can you please elaborate on what you are expecting me to do here?
Thanks !!

  1. Try passing the bytes st.session_state.uploaded_file.getvalue() instead.
  2. Doesn’t make any sense to me and the error is expected (the file is not made of just utf-8-encoded text).
  3. Uploaded files are kept in memory. If you want to read them from the file system you need to save them first.

Tried below two set of codes & couldn’t get results:

  1. Getting blank list output and not the parsed list of document
st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                ).load_data(st.session_state.uploaded_file.getvalue())

 st.write('checkpoint3')
 st.write(st.session_state.doc_parsed)

Output:

checkpoint3
[]
  1. Tried below:
st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue())

st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(st.session_state.uploaded_file_stringio)

st.write('checkpoint3')
st.write(st.session_state.doc_parsed)

Error:

TypeError: initial_value must be str or None, not bytes
Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 39, in <module>
    st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue())
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Well, that seems to be what your LlamaParse instance can do with your file. I have never used LlamaParse, so I don’t know what to expect or what could be wrong.

As a last resort, compare that to what you get by passing the path to the file instead of the file itself.

parser.load_data("path/to/your/file.pdf")

I have already parsed this pdf document few days back using llama_parser() without streamlit and it returns 20 elements equivalent to 20 pages. Its only when using streamlit I am facing issues.
And when I am using it in streamlit it gives me instant result (blank list) as if it had nothing to parse whereas it takes some time to process when I run llamaparse without streamlit.

See this screenshot:

I think the issue is that it requires file in .pdf format so when I am passing the variable/object instead of .pdf then it doesn’t work. May be if I host this app somewhere and then first save the uploaded file and then give that path to the llamaParse then it could work. Could be wrong here as I don’t have much experience in this.

That is not the comparison you need.

  • Few days back is enough for something to change and you forgetting or just not noticing.
  • With / without streamlit can mean different environments. It is not uncommon for people to inadvertently switch environments, more so when running code using streamlit, ipython, jupyter, vscode…
  • The parser is instantiated with an extra argument in the notebook version.
  • …

You want to compare the result of passing the file name and the bytes in the same execution of the same program, using the same parser.

parsed_1 = parser.load_data(st.session_state.uploaded_file.getvalue())
parsed_2 = parser.load_data("path/to/your/file.pdf")

Passing the file name should have the same effect than passing the contents of the file. But the best way to know for sure is comparing the results while keeping everything else the same.

  1. Actually I ran both streamlit and parser 6 days back on 13th Sep and created this Stackoverflow post when I ran both:

  2. Just Now Ran the parser again without streamlit and that gives results

  3. Just Now Ran the parser with in streamlit with below code

st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                ).load_data(st.session_state.uploaded_file.getvalue())

        st.write('checkpoint3')
        st.write(st.session_state.doc_parsed)`

streamlit output:

I am still getting No Results when I run in Streamlit where as I get results when I work without streamlit.

Can someone pls check this from streamlit team with any of their pdfs ???

Since you seem focused in streamlit vs without streamlit, what if you use the file name in the streamlit application?

st.session_state.doc_parsed = LlamaParse(
    result_type="markdown",
    api_key=key_input
).load_data("path/to/your/file.pdf")

Well this was a good idea to try and It has worked this way. As you suggested I sourced file from my local directory in streamlit instead of uploaded file:

st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                ).load_data(r"V:\my_path\2024-08-03_VINEET-tests.pdf")

st.write('checkpoint3')
st.write(st.session_state.doc_parsed)

P.S:

So now how can I solve the puzzle of using uploaded file by each user for parsing?

Without user upload I won’t be able to do it in production with streamlit.

Please help !!!

Hi.
As it was working with direct upload of file into the model and I came across this streamlit_link about creating temporary files and tried below code but still getting blank results. Do you think anything else can be tried as well or any modification to this step can help?

from tempfile import NamedTemporaryFile

if st.session_state.uploaded_file:

    with NamedTemporaryFile(suffix=".pdf") as temp:
        temp.write(st.session_state.uploaded_file.getvalue())
        temp.seek(0)
        st.write(temp.name)

        st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
                                    ).load_data(temp.name)
        st.write("step completed")
        st.write(st.session_state.doc_parsed)

Output

C:\Users\vin\AppData\Local\Temp\tmpffwd7k9o.pdf

step completed

[]

Appreciate any sort of help!!

I was unable to test this until right now.

When I pass a file-like object (the uploaded file)

st.session_state.doc_parsed = LlamaParse(
    result_type="markdown", api_key=key_input
).load_data(st.session_state.uploaded_file)

I get an empty list as well, along with a message in the terminal:

Error while parsing the file '<bytes/buffer>': file_name must be provided in extra_info when passing bytes

Passing extra_info={"file_name": "_"} to load_data as sugested by the message fixed it for me. As far as I can tell any non-empty string works. This is absent from the examples and I have no idea why it is needed.

2 Likes

Thanks aloootttttt. This really solved the issue for me as well.

It was my bad that I was not looking at terminal as it was returning blank list result. But your attention to details really helped.

Really Really Appreciate your help. Thanks again :slight_smile:

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.