DocDocGo: chatbot does "infinite" web research, creates KBs from websites or your files

Hi all!

I made an app that can do in-depth Internet research for you, generating a report with every research iteration, then answer any follow-up questions using all of the fetched sites as a knowledge base (KB).

It also allows you to create a KB out of your local files and chat with them. If you have some dense documentation files that you want to ask questions about, just upload them and chat away!

DocDocGo screenshot 1

By default the bot operates using GPT 3.5, but you can enter your own OpenAI API key and then have the option to switch to GPT 4. Also, all users sharing the default key also share the KBs they create, but if you use your own key it automatically serves as your unique user ID and your subsequent KBs won’t be visible to other users.

Not sure how to get started? No problem, just ask DocDocGo itself! It comes prepackaged with a default KB of its own docs, so it knows how to use it.

This is work in progress, there’s definitely lots of room for improvement. Any feedback is always welcome. Links:

16 Likes

Very cool!

1 Like

Thanks! I recently redesigned the algorithm for combining results from many sites into a report - now the way to do “infinite” research is:

/research auto [number of desired iterations]

Or, as before, ask the bot itself for help using it :grin:

Thanks for sharing!

So much there. Great app. Thanks for sharing.

Thanks, @asehmi @WendyOngg, and please stay tuned for more improvements of the “infinite” research functionality and other features!

Promised improvements delivered - a new version of DocDocGo is out, with a much improved RAG algorithm and a more intuitive “infinite” research command.

“Infinite” research
The previous “/research auto N” command still works, but it led to some confusion around the right N to use. So now we have the more straightforward “/research deeper” command, which simply doubles the number of websites the bot “reads” and incorporates into its final report.

So the simplest way to do web research is now as follows:
a. “/research legal arguments for and against disqualifying Trump”
b. “/research deeper” to double the number of sources OR “/research deeper N” to double them N times
c. Read the final report and chat with the resulting knowledge base, e.g. "Which legal scholars have argued for disqualification?"

Improved RAG
I have revamped the algorithm that selects the document excerpts to submit to the LLM. The new algorithm is a custom version of parent document retrieval with some twists I came up with, which results in more accurate answers and a reduced chance that important information is missed.

Give it a try and remember - you can always ask the bot itself for help navigating all its features and commands (as long as the default collection docdocgo-documentation is selected).

[Edit: You explained this in the .env file example] I’m a little confused as to what the docker container is running. Can you explain please?

I have Python 3.9, so I can’t run the Streamlit app. It seems your Streamlit app is using Python 3.11+'s pipe operator union types (and for dict merges). Will have to convert to use typing.Union syntax.

Looks like you already found the answer, but yes, the Docker container is running the Chroma database. However, I made it so this is optional: if you copy .env.example to .env and fill in your OpenAI API key (the only required value, though a couple others are recommended) you can run the app without Chroma running in a container, it will just be on your local drive.

I am using 3.11 throughout, even outside the Streamlit component (which is optional). But I think if you make the modifications you mentioned then it may work with 3.9. I have not tested that though.

Yes, it worked! It’s a very nice application that I’m enjoying using. I spent a whole evening with it researching dynamic time warping. Once I get my head around its design it’ll help me add conversational elements to apps that I’m working on. I particularly like your short commands to switch modes and the care you took to manage mal events (formatting, exceptions, etc.), which are so common in LLM (networked) apps. Many thanks for sharing.

1 Like

@asehmi I wasn’t familiar with dynamic time warping, it sounded like very science-fictiony, but I had ChatGPT cure me of my ignorance. I’m glad you’re finding the research feature useful!

A new relatively simple but useful feature is out - you can now:

  • get a summary of a website/PDF (run “/summarize https://example.com”)
  • ask follow-up questions
  • add another website/PDF to the knowledge base, and another… (run /summarize again, or “/ingest https://example2.com” if you don’t want a summary)

When you are done, just rename the resulting collection to whatever you want or delete it:

  • /db rename my-cool-collection” OR
  • /db delete --current

Announcing a new awesome update: you can now ask DocDocGo to redo research reports, if the initial format or content wasn’t what you had in mind, while still keeping all of the content it has ingested into a KB.

This is really useful, because one annoyance with the “infinite” research has been that DDG finds and ingests lots of useful information, but writes a report that’s too long/short, or has a different format from what you had in mind, or emphasizes the wrong information.

Previously the only recourse was to start a new research from scratch, but now you can just quickly rewrite the report(s) using the already ingested content, which is ~10x faster:

  1. Run /research claims Putin made in interview with Tucker Carlson - your original research query (just an example)
  2. Run /research view stats - to review your query and auto-generated report type
  3. Run /research set-report-type Numbered list, with brief description of claim, fact-check, and URL of source - set your custom report format
  4. Run /research startover - to quickly rewrite the report using already fetched content
  5. (Optional) Run /re deeper N - “infinite” research feature to expand the number of sources 2^N-fold

Another new feature is the ability to ingest into an existing collection - useful if you want to supplement a KB with your local docs or specific websites.

Remember, if you don’t know how to do something or forgot a command, just tell the bot what you want to do, prefixed with /help, e.g.:

/help how can I ingest more content into a collection?