Embed pdf files that are larger than 2MB

Hi all, :slight_smile:

I am trying to embed a pdf window by using:

def displayPdf(self):
        base64Pdf = base64.b64encode(open(self.pathDisplayPdf, "rb").read()).decode("utf-8")
        pdfDisplay = StrParser.getEmbeddedPdf(base64Pdf=base64Pdf, height=1000)
        self.placeHolder.markdown(pdfDisplay, unsafe_allow_html=True)

where getEmbeddedPdf is defined as:

def getEmbeddedPdf(base64Pdf, height: int):
        return (
            f'<embed src="data:application/pdf;base64,{base64Pdf}" width=100% height="{height}" type="application/pdf">'
        )

This method succeeds on smaller files (<~1.5-2MB) but doesn’t load larger ones (>~2MB). The screen just turn black.

So my questions are:

  1. Is there a more elegant way to embed pdf that is not limited by size?
  2. what can be the problem with my current method?

humbly,
Don

I do a lot of work with big PDFs. It might be easier to split your big PDFs into multiple pages and have the getEmbeddedPDF function rotate them. Splitting is easy using Acrobat or pdftk (which can do it on the fly as you prepare your view).

Thanks for your reply!
Following your suggestion, I tried this:

self.doc = fitz.open(self.pathToPdf)

and

def getPdfContent(self, page=1):
        self.doc.select([page])
        return self.doc.write()

and pass the page-specific bytes content to:

def displayPdf(self):
        base64Pdf = base64.b64encode(self.getPdfContent()).decode("utf-8")
        pdfDisplay = StrParser.getEmbeddedPdf(base64Pdf=base64Pdf, height=1000)
        st.markdown(pdfDisplay, unsafe_allow_html=True)

But again, this worked with smaller pdfs as expected, it shows only the specified page, but not with larger ones. Larger files still dont load.
I guess I could save the page (on its own) as a temp pdf file to local drive and read its content, but for various reasons I’d avoid doing so.
I think there is probably a timeout mechanism in loading pdfs, trying to figure out how to overcome it.

Again thanks,
Don

1 Like

Yes, I don’t think PDFs “stream”, you need to hold the whole structure in memory somehow. I fully understand that splitting the PDFs into many small page files is ugly, but it would probably work. You could also turn each page into an image.