Multi-Modal RAG ChatBot: Your AI-Powered Knowledge Assistant (Streamlit + MindsDB + LangChain + FAISS)

My Quira Quest 14 submission
The Multi-Modal RAG ChatBot is an innovative application designed to enhance your knowledge retrieval experience using PDFs and YouTube videos. Our chatbot provides seamless access to relevant text, images, and video frames based on your queries.

Multi-Modal Retrieval :books::movie_camera:: Instantly fetches text, images, and video frames from static PDFs and YouTube videos to answer your queries.
Nice UI for User Interaction :art:: Enjoy a user-friendly interface that makes interacting with the chatbot smooth and intuitive.

Future Enhancements:
Dynamic Multi-Modal RAG: Addressing the high computational challenge of creating a multi-modal vector database for dynamic data.
On-Device Privacy :lock:: Ensuring data never leaves your device for complete privacy and security.
Knowledge Graph Without LLMs: Moving towards a knowledge graph-based approach without relying on LLMs.
Open Source Collaboration :globe_with_meridians:: Encouraging contributions to push the boundaries of machine learning and privacy-centric technology.
On-Device GPUs Access: Ensuring for creating advacned knoweldge graphs without relying on any cloud etc.

For those interested in using their own resources, we have provided comprehensive Colab tutorials to get you started.
GitHub Repo (give it a star :star:)

Quira Voting Link (Requesting everyone with valuable feedback and suggestions for future improvements…!)

Detailed YouTube Video (If you like the demo, please like, share and subscribe video…!)

Live WebApp :-


I’m requesting everyone to provide feedback and suggestions on my work :pray:

I think this needs to have function calling (tools) for internet access. I asked a Q and got a wild guess (non-contextual) A… it didn’t know about RAG.

Q: What is zero-shot prompting and how can it be used for RAG?
A: Zero-shot prompting is the simplest technique used with chatbots where they are asked to answer a question or perform a task without being provided with any examples. This technique does not include any examples in the prompt, relying on the chatbot’s knowledge to generate accurate responses. Zero-shot prompting is advantageous as it requires minimal effort, is accessible to everyone without technical expertise, and can be applied to various tasks with versatility. However, it has limitations, such as limited performance compared to other prompting techniques and a reliance on carefully crafted prompts for optimal results.

In terms of using zero-shot prompting for RAG (Red, Amber, Green) analysis, chatbots can be prompted with questions related to assessing the status of a project, task, or any other relevant metric using the RAG rating system without providing specific examples. The chatbot would then generate responses indicating whether the status is red (critical issues), amber (caution), or green (on track). This allows for quick assessments without the need for extensive input examples, making it a convenient method for analyzing and monitoring various indicators in a straightforward manner.

1 Like

@asehmi but we can’t get the relevant image from function calling right. My intention is to get the relevant frames from the video for better quality of the answer. For example i asked one question in the pdf and returned 90 percentage of question correct and 10 percentage wrong. But interesting part was it returned the accurate image of table relevant to the question where I get the perfect satisfactory answer. So sometimes we can’t believe on text answers right. Also i mentioned the dynamic Multi-Modal rag needs an strategic approach for building the robust multi modal vector data bases for better retrieval.

Here is the proof that why we can’t believe on text always.

I guess one can ask more precise questions if one knows the content uploaded. I’ve recently explored Agentic AI techniques and the problem of quality assurance and accuracy is often cited as a reason to use agents; some doing data extraction, others doing understanding, and others doing review and verification, before compiling the final report… perhaps with yet another agent. The agents can be given tools to assist in these tasks. (Caveat: Unless you’re using a local LLM, Agentic AI will get expensive pretty quickly because of the multiple agent collaborations and potentially many LLM calls as the intermediate results are refined.)

Yes your right @asehmi . Recently I have developed an project that it will gives the top headline news summary article of the user interested country in live. Because if you want get the summary of top live news article headline, we need to use more than one application right. For that i have used two agents . One agent crawls the entire web and gives the top healing article url of the specified country in the live and another agents takes the url and gives the summary of that. I developed the entire solution using agents framework. I’m highly suggesting you to open the website deltav.
I’m also attaching the demo for better understanding. Have a look on it and let me know any suggestions and feedback on it.

Demo :film_projector:

I think many use cases, like in the deltav demo, are not good use cases for agentic AI. Sure, you might be able to solve the problem with agents, but they are much more easily solved with less end-user friction using standard techniques. Agents are ideally suited to non-deterministic, fuzzy and unstructured problem solving use cases. Call me a purist, but it’s early days in this space and so many demo apps are really only proving their frameworks are capable of building agent apps, but aren’t actually doing this by solving a difficult fuzzy problem in an innovative way.

yes, im agreeing with you @asehmi

Not sure why I am tagged here

the reason for tagging is that you can check the project and give any improvements or suggestions in your free time.