Adding a long PDF as a custom data source

I think LlamaIndex’s VectorStoreIndex will do the document chunking for you. Although you may wish to experiment with different types of chunking. Also take a look at this, which has an ingestion pipeline (which runs externally from the command line) that would be useful in your use case because it includes topic analysis. You may be able to combine ideas from it, my app and your own.

If you’re just experimenting, then you have more than enough to get going. If you want to dive deeper, then your next stop should be spacy-llm.