I have uploaded a PDF file using st.file_uploader() from streamlit and then trying to parse it using LlamaParse() currently running on localhost.
Issue is I am getting blank list as output[] when I am using this uploaded_file object from the below code.
I am not sure if a returned uploaded object can directly be used to feed llama_parse as this PDF parsing works when I directly upload the PDF from my local machine directory into LlamaParse() instead of streamlit uploaded_file = st.file_uploader()
Looks like the file is not going through llama. Is there something that I need to do for using uploaded_file before feeding it into LlamaParse()?
I have also been through below links and used stringio but still stuck with errors or no results.
Stuck on this from days and also posted this on SO and really need help on this as it will otherwise jeopardize the whole project.
sample code:
import streamlit as st
import pandas as pd
import os
from io import StringIO
import nest_asyncio
nest_asyncio.apply()
os.environ["LLMA_CLOUD_API_KEY"] = my_secret_key
key_input = my_secret_key
from llama_parse import LlamaParse
############### Setting Configuration ###############
st.set_page_config(page_title="Pdf Parsing",
layout='wide',
initial_sidebar_state="expanded")
# title
st.markdown("<h1 style='text-align: center; font-size: 70px;color: black;letter-spacing: 4px;'>PDF Parsing</h1>", unsafe_allow_html=True)
st.write("checkpoint1")
with st.container(border=True):
st.write("checkpoint2")
st.session_state.uploaded_file = st.file_uploader('Choose your .pdf file to upload', type="pdf")
if st.session_state.uploaded_file is not None:
st.write(st.session_state.uploaded_file.name)
st.success("File is uploaded")
# https://docs.streamlit.io/develop/api-reference/widgets/st.file_uploader
# https://discuss.streamlit.io/t/expected-str-bytes-or-os-pathlike-object-not-uploadedfile-for-pdf-file/10436
st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))
if st.session_state.uploaded_file:
# st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
).load_data(st.session_state.uploaded_file)
st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
).load_data(st.session_state.uploaded_file_stringio)
st.write('checkpoint3')
st.write(st.session_state.doc_parsed)
Error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 11: invalid start byte
Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
result = func()
^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 35, in <module>
st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue().decode("utf-8"))
According to the documentation I found you can pass a file object or just bytes. Trying to decode the bytes to text doesn’t make sense, since a pdf file is made of much more than encoded text.
When I directly use uploaded pdf file then I get [] blank list output. Code" st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input).load_data(st.session_state.uploaded_file)
When I use stringio on uploaded file then I am getting an error and as I understood from your reply that this step is not required.
If I follow llama_github documentation and use SimpleDirectoryReader() then what is the path that I need to mention for streamlit uploaded file ?
I have tried below codes based on SimpleDirectoryReader():
parser = LlamaParse(
api_key=key_input, # can also be set in your env as LLAMA_CLOUD_API_KEY
result_type="markdown", # "markdown" and "text" are available
verbose=True,
Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
result = func()
^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 80, in <module>
documents = SimpleDirectoryReader(
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\llama_index\core\readers\file\base.py", line 259, in __init__
if not self.fs.isdir(input_dir):
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\fsspec\implementations\local.py", line 134, in isdir
path = self._strip_protocol(path)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\fsspec\implementations\local.py", line 233, in _strip_protocol
if path.startswith("file://"):
^^^^^^^^^^^^^^^
I am not sure what else I can do. Can you please elaborate on what you are expecting me to do here?
Thanks !!
TypeError: initial_value must be str or None, not bytes
Traceback:
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
result = func()
^^^^^^
File "C:\Users\xxx\anaconda3\envs\llma_py_3_12\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
exec(code, module.__dict__)
File "V:\1. R & Python work\Python\xxx\pdf parser llma\Testing_upload_llama.py", line 39, in <module>
st.session_state.uploaded_file_stringio = StringIO(st.session_state.uploaded_file.getvalue())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Well, that seems to be what your LlamaParse instance can do with your file. I have never used LlamaParse, so I don’t know what to expect or what could be wrong.
As a last resort, compare that to what you get by passing the path to the file instead of the file itself.
I have already parsed this pdf document few days back using llama_parser() without streamlit and it returns 20 elements equivalent to 20 pages. Its only when using streamlit I am facing issues.
And when I am using it in streamlit it gives me instant result (blank list) as if it had nothing to parse whereas it takes some time to process when I run llamaparse without streamlit.
I think the issue is that it requires file in .pdf format so when I am passing the variable/object instead of .pdf then it doesn’t work. May be if I host this app somewhere and then first save the uploaded file and then give that path to the llamaParse then it could work. Could be wrong here as I don’t have much experience in this.
Few days back is enough for something to change and you forgetting or just not noticing.
With / without streamlit can mean different environments. It is not uncommon for people to inadvertently switch environments, more so when running code using streamlit, ipython, jupyter, vscode…
The parser is instantiated with an extra argument in the notebook version.
…
You want to compare the result of passing the file name and the bytes in the same execution of the same program, using the same parser.
Passing the file name should have the same effect than passing the contents of the file. But the best way to know for sure is comparing the results while keeping everything else the same.
Well this was a good idea to try and It has worked this way. As you suggested I sourced file from my local directory in streamlit instead of uploaded file:
Hi.
As it was working with direct upload of file into the model and I came across this streamlit_link about creating temporary files and tried below code but still getting blank results. Do you think anything else can be tried as well or any modification to this step can help?
from tempfile import NamedTemporaryFile
if st.session_state.uploaded_file:
with NamedTemporaryFile(suffix=".pdf") as temp:
temp.write(st.session_state.uploaded_file.getvalue())
temp.seek(0)
st.write(temp.name)
st.session_state.doc_parsed = LlamaParse(result_type="markdown",api_key=key_input
).load_data(temp.name)
st.write("step completed")
st.write(st.session_state.doc_parsed)
I get an empty list as well, along with a message in the terminal:
Error while parsing the file '<bytes/buffer>': file_name must be provided in extra_info when passing bytes
Passing extra_info={"file_name": "_"} to load_data as sugested by the message fixed it for me. As far as I can tell any non-empty string works. This is absent from the examples and I have no idea why it is needed.
Thanks for stopping by! We use cookies to help us understand how you interact with our website.
By clicking “Accept all”, you consent to our use of cookies. For more information, please see our privacy policy.
Cookie settings
Strictly necessary cookies
These cookies are necessary for the website to function and cannot be switched off. They are usually only set in response to actions made by you which amount to a request for services, such as setting your privacy preferences, logging in or filling in forms.
Performance cookies
These cookies allow us to count visits and traffic sources so we can measure and improve the performance of our site. They help us understand how visitors move around the site and which pages are most frequently visited.
Functional cookies
These cookies are used to record your choices and settings, maintain your preferences over time and recognize you when you return to our website. These cookies help us to personalize our content for you and remember your preferences.
Targeting cookies
These cookies may be deployed to our site by our advertising partners to build a profile of your interest and provide you with content that is relevant to you, including showing you relevant ads on other websites.