Editing user-submitted PDF with PyMuPDF then making it available for download

gkrut · May 5, 2023, 6:58pm

Summary

Trying to take a user-uploaded PDF, edit it based on user inputs, then spit back out multiple different copies to be downloaded. Using PyMuPDF I was able to make it work when hosted entirely locally (no streamlit) but having trouble with file management now that it’s on streamlit.

I’m currently running this on localhost, if that changes things.

Steps to reproduce

Code snippet:
For user to upload PDF:

script = st.file_uploader("Upload Script (.pdf only)",type="pdf")

Edit and make download button:

if script is not None:
   with fitz.open(stream=script.read(), filetype="pdf") as pdf_file:
      pdf_page_count = pdf_file.page_count   
      for page in range(pdf_page_count):  
         page_obj = pdf_file[page] 
         content_of_page = pdf_file.get_page_text(page) 
         match_word = character_list[0] 
         content_of_page = page_obj.get_text("words",sort=False)  #get rect for all words
         for word in content_of_page:
            if word[4] == match_word:
               rect_comp = fitz.Rect(word[0],word[1],word[2],word[3])
               highlight = page_obj.add_highlight_annot(rect_comp)
               highlight.set_colors(stroke=[0, 1, 0.8])
               highlight.update()
      st.download_button(
         label="Download Script",
         data=pdf_file,
         file_name="Highlighted Script",
         mime="application/octet-stream"
         )

Expected behavior:

When running a modified version of the above script through command prompt, it works and spits out a highlighted script (or scripts, by running the highlight function once for each character/actor pair) using pdf_file.save()

Actual behavior:

The above code gives the following error after a PDF is uploaded:

RuntimeError: Invalid binary data format: <class ‘fitz.fitz.Document’>

Traceback:

File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "C:\Users\ME\Desktop\Python Projects\highlighter_web.py", line 98, in <module>
    st.download_button(
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\metrics_util.py", line 332, in wrapped_func
    result = non_optional_func(*args, **kwargs)
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\elements\button.py", line 311, in download_button
    return self._download_button(
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\elements\button.py", line 355, in _download_button
    marshall_file(
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\elements\button.py", line 487, in marshall_file
    raise RuntimeError("Invalid binary data format: %s" % type(data))

Debug info

Streamlit version: 1.22.0
Python version: 3.10.4
OS version: Win10
Browser version: Brave 1.51.110 (up to date)

Additional information

I seem to be able to manipulate the uploaded PDF in some ways (like in this post), but things seem to fall apart when it comes time to download.

Thanks in advance for your help!

Goyo · May 5, 2023, 8:01pm

Indeed, pdf_file is a Document object, not what st.download_button expects at all. You can save the document to a io.BytesIO object.

gkrut · May 5, 2023, 8:35pm

Thanks for this lead! At this risk of sounding stupid- io.BytesIO is new to me. Can I ask how you’d recommend using it? Is it a matter of turning pdf_file into a an io.BytesIO object and then putting that object into st.download_button()? (and, if so, how?)

gkrut · May 5, 2023, 9:16pm

Got it to work!! Added the following:

         output_buffer = io.BytesIO()
         pdf_file.save(output_buffer)
         pdf_bytes = output_buffer.getvalue()
      st.download_button(
          label="Download Script",
          data=pdf_bytes,
          file_name="Highlighted Script.pdf",
          mime="application/pdf"
      )

And it works like a charm! Thank you for your prompt help

gkrut · May 6, 2023, 2:32am

New dilemma- The above works fine on its own, but seems to break if I try to do it multiple times in a row. If I try to put something like:

for n in range(len(actor_list)):
   with fitz.open(stream=script.read(), filetype="pdf") as pdf_file:

It works for the first pass, but then I get EmptyFileError: cannot open empty document. Any idea why that may be?

gkrut · May 6, 2023, 2:46am

Gosh I’m sorry for blowing this thread up all by myself. I sorted this out with help from this thread by throwing

uploaded_file.seek(0,0)

before the with fitz.open(stream...

No clue why it works, but not complaining!

system · May 8, 2023, 2:46am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to use pymupdf to read a pdf after uploading that via st.file_uploader()? Using Streamlit file-upload	5	14098	May 13, 2022
Issue with pymupdf on SCC Community Cloud	2	551	May 11, 2024
Streamlit App - Converting an Uploaded PDF to Seperate Images for Downloading Using Streamlit file-download	3	2730	February 1, 2024
Unable to import fitz (PyMuPDF) Using Streamlit	2	3928	January 22, 2024
How to merge pdf files and offer download Using Streamlit	5	1033	May 5, 2023

Editing user-submitted PDF with PyMuPDF then making it available for download

Summary

Steps to reproduce

Debug info

Additional information

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies