Adding a long PDF as a custom data source

I built a chatbot primarily from this (extremely helpful) Streamlit blog post. However, my custom data source is my PhD dissertation (a 227 page pdf). I changed the chat_mode to openai so it could pull from both the internet as well as my dissertation.

It seems like the app has not ingested the entire document, though. For example, when I ask:

Can you tell me about Study 3 in Adam Goodkind's thesis?

I get the reply:

I apologize, but without access to the specific details of Adam Goodkind's dissertation, I cannot provide information about Study 3 or any specific studies within the thesis. The summary provided earlier was a general overview of the dissertation's focus on predicting social dynamics using keystroke patterns. For detailed information about specific studies within the dissertation, it would be best to refer to the actual document or reach out to Adam Goodkind directly.

When I ask for a summary of my entire dissertation, it generates a summary that seems to be pulled entire from the Abstract towards the beginning of my thesis, or even just the first page only. I saw a similar issue on this forum and redeployed with python 3.11, but it did not fix the issue.

How can I get the chatbot to ingest my entire document, both for answering specific questions and summarizing?

In answer to the questions:

  1. I have deployed the app on the Community Cloud, at
  2. Here is the GH repo: GitHub - angoodkind/thesis-chat-v0

Again, my goal is to create a chatbot that can “chat” with my entire thesis. Am I doing this wrong?

Oddly enough, today I came across this article. It may help you mine your thesis to generate a longer summary spanning content from multiple parts across it. I also think the default vector store representation doesn’t create a structure that’s rich enough to be of much use in the RAG stage. What you need is a better representation of the different topics and/or chapters of your thesis and then you can make a more nuanced RAG query to build the context sent to the LLM.

Have a look at my app “An LLM-based document and data Q&A App (with knowledge graph visualization)”, to see how I’ve used Weaviate as a Vector Store. In this particular release of my app the representation is very simple too, but in other private apps I’ve created for clients, Weaviate has allowed me to define quite sophisticated content class structures against which I am able to do some interesting semantic searches as part of the RAG stage. Feel free to extend the Weaviate class definition (the Weaviate docs are pretty good). I think Pinecone, and other Vector DBs, will have similar capabilities, but I haven’t got any experience with them.

P.S. You’re not doing it wrong. The (extremely helpful) tutorial is only the first step. Like most things, the devil is in the details.

My 2c.

Hi @asehmi , thank you so much for all of this! This stuff looks great, and I want to dive into it.

One quick question: Can all of this either a) be integrated into my Streamlit app, or b) be easily deployed to my personal website? Part of my motivation for the Streamlit app is to build a tool that I can show off to demonstrate my knowledge of chatbots/LLMs to employers.


EDIT: I just actually read your blog post and see you deployed it on Streamlit. Oops… Can the agent design also be deployed on Streamlit?

NEW QUESTION: I want to get started quickly on this, since I’m currently job-searching. Is forking/modifying your app the best way to get a quick start, or is there a better way to get a quick start?

You’re welcome… would love to see how far you go!

Yes, you can build a Streamlit app and embed it in any website that supports iFrames (you’ll have to add ?embed=true to the app URL, see: Deploy your app - Streamlit Docs)

Thanks, yup, I’ve been playing around with the embedding as well. So is forking/modifying your (very nice) app the easiest way to get a “quick and dirty” start?

Yes that will help. I think I can do much better with the knowledge graph implementation using something like PyVis, which I discovered recently. I wrapped an existing JS graph implementation as a Streamlit Component, and it’s a bit clunky. The layout of the app can also be improved. Still, you can use it as a staring point.

You’ll see I use _set_state_cb with most of my widgets. To understand that see the answer I gave to this user: Streamlit login solution need to click on "Login" button twice to login - #2 by asehmi

Send me DMs if you want some help with Weaviate, and I’ll try to help you? I have a PhD thesis too and other long docs that may get a new lease of life (for me) from your solution :slight_smile:.

Interesting! Good to know! I’ll keep you updated on everything!

If you come across another easy way to ingest a long pdf, I’d love to know. Would it be better to upload multiple short pdf’s, e.g. chapter-by-chapter?

1 Like

I think LlamaIndex’s VectorStoreIndex will do the document chunking for you. Although you may wish to experiment with different types of chunking. Also take a look at this, which has an ingestion pipeline (which runs externally from the command line) that would be useful in your use case because it includes topic analysis. You may be able to combine ideas from it, my app and your own.

If you’re just experimenting, then you have more than enough to get going. If you want to dive deeper, then your next stop should be spacy-llm.

For some reason the app is now not reading my thesis at all. When I ask a question about my thesis, it says that it doesn’t have access to the that document. Is there something simple I can use here, to test this:

def load_data():
    with st.spinner(text="Loading and indexing Adam's thesis – hang tight! This should take 1-2 minutes."):
        reader = SimpleDirectoryReader(input_dir="./data", recursive=True)
        docs = reader.load_data()
        service_context = ServiceContext.from_defaults(llm=OpenAI(model="gpt-3.5-turbo", temperature=0.5, system_prompt="You are an expert on human-computer interaction and your job is to answer questions about Adam Goodkind's dissertation. Assume that all questions are related to the dissertation. Keep your answers based on facts from the dissertation or from general knowledge – do not hallucinate facts."))
        index = VectorStoreIndex.from_documents(docs, service_context=service_context)
        return index
  • Did you delete the doc from ./data?
  • Is it even entering the indexing function? (Use a debugger or print statements.)
  • Test with a small PDF not your full thesis.
  • Remove the cache decorator for testing/debugging.
  • Use Path from pathlib to construct an absolute path for SimpleDirectoryReader.
  • Not sure where the VS index is being stored and if you can query it, so I’d replace the default with one which can be queried independently of the RAG using it’s own API. That way you can verify that your data has actually been indexed. Try a simple VS like ChromaDB, and update load_data() to use it.