Editing user-submitted PDF with PyMuPDF then making it available for download

Summary

Trying to take a user-uploaded PDF, edit it based on user inputs, then spit back out multiple different copies to be downloaded. Using PyMuPDF I was able to make it work when hosted entirely locally (no streamlit) but having trouble with file management now that it’s on streamlit.

I’m currently running this on localhost, if that changes things.

Steps to reproduce

Code snippet:
For user to upload PDF:

script = st.file_uploader("Upload Script (.pdf only)",type="pdf")

Edit and make download button:

if script is not None:
   with fitz.open(stream=script.read(), filetype="pdf") as pdf_file:
      pdf_page_count = pdf_file.page_count   
      for page in range(pdf_page_count):  
         page_obj = pdf_file[page] 
         content_of_page = pdf_file.get_page_text(page) 
         match_word = character_list[0] 
         content_of_page = page_obj.get_text("words",sort=False)  #get rect for all words
         for word in content_of_page:
            if word[4] == match_word:
               rect_comp = fitz.Rect(word[0],word[1],word[2],word[3])
               highlight = page_obj.add_highlight_annot(rect_comp)
               highlight.set_colors(stroke=[0, 1, 0.8])
               highlight.update()
      st.download_button(
         label="Download Script",
         data=pdf_file,
         file_name="Highlighted Script",
         mime="application/octet-stream"
         )

Expected behavior:

When running a modified version of the above script through command prompt, it works and spits out a highlighted script (or scripts, by running the highlight function once for each character/actor pair) using pdf_file.save()

Actual behavior:

The above code gives the following error after a PDF is uploaded:

RuntimeError: Invalid binary data format: <class ‘fitz.fitz.Document’>

Traceback:

File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
File "C:\Users\ME\Desktop\Python Projects\highlighter_web.py", line 98, in <module>
    st.download_button(
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\runtime\metrics_util.py", line 332, in wrapped_func
    result = non_optional_func(*args, **kwargs)
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\elements\button.py", line 311, in download_button
    return self._download_button(
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\elements\button.py", line 355, in _download_button
    marshall_file(
File "C:\Users\ME\AppData\Local\Programs\Python\Python310\lib\site-packages\streamlit\elements\button.py", line 487, in marshall_file
    raise RuntimeError("Invalid binary data format: %s" % type(data))

Debug info

  • Streamlit version: 1.22.0
  • Python version: 3.10.4
  • OS version: Win10
  • Browser version: Brave 1.51.110 (up to date)

Additional information

I seem to be able to manipulate the uploaded PDF in some ways (like in this post), but things seem to fall apart when it comes time to download.

Thanks in advance for your help!

Indeed, pdf_file is a Document object, not what st.download_button expects at all. You can save the document to a io.BytesIO object.

1 Like

Thanks for this lead! At this risk of sounding stupid- io.BytesIO is new to me. Can I ask how you’d recommend using it? Is it a matter of turning pdf_file into a an io.BytesIO object and then putting that object into st.download_button()? (and, if so, how?) :sweat_smile:

Got it to work!! Added the following:

         output_buffer = io.BytesIO()
         pdf_file.save(output_buffer)
         pdf_bytes = output_buffer.getvalue()
      st.download_button(
          label="Download Script",
          data=pdf_bytes,
          file_name="Highlighted Script.pdf",
          mime="application/pdf"
      )

And it works like a charm! Thank you for your prompt help :slight_smile:

New dilemma- The above works fine on its own, but seems to break if I try to do it multiple times in a row. If I try to put something like:

for n in range(len(actor_list)):
   with fitz.open(stream=script.read(), filetype="pdf") as pdf_file:

It works for the first pass, but then I get EmptyFileError: cannot open empty document. Any idea why that may be?

Gosh I’m sorry for blowing this thread up all by myself. I sorted this out with help from this thread by throwing

uploaded_file.seek(0,0)

before the with fitz.open(stream...

No clue why it works, but not complaining!

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.