GitHub QA Chatbot with Langchain!

Hello, guys!

our project team made an GitHub Repository ChatBot.
GitHub QA ChatBot is

  • Question-Answering with github’s speicific repository
  • You can visualize folder structure and file-content

Chat-GPT can answer information in various fields. However, Chat-GPT is not learning about the latest technology framework, the latest code, so we have to study while looking at the GitHub repository. Based on these points, the app was developed.

we will close our demo-app if there costs lots of open-ai-api-key :smiling_face_with_tear: But we want to share it with a lot of people who are my strength, and we want to keep this service for a long time.
And Here is Demo-app link and Github repository link!

For Whom?

  • For whom who wants to lean new repository that chat-gpt cannot answer about the information
  • For whom who wants to learn coding by viewing github - repository


The tech stack is like this

Github Restapi -  Repository Info by DFS algorithm
LangChain - Communication Data with GPT-3.5-turbo
VectorDB - FAISS ( mmr algorithm use)
Streamlit - Deployment 

How it works?

  • QA Chat Bot Page
1. Get User and Repository Information by Github Rest api and we get the information by dfs algorithm. We can get an data with 
{file_name_1 : file_content, file_name_2 : file_content, file_name_3: file_content}

2.  Change Dictionary data types to document types.
   We also manually add the data which involes the folder structure by python anytree.
   So the QA chat bot can also answer about the folder structure.

3.  Document Chunking with 1000 tokens and  0 overlap with FAISS.
   - we want to use pinecone vector Database . But because we use free tier.
     we can't handler multi user by pinecone.

4. Question Answering with LangChain QA Retriever

  • Folder Structure Page
1. Get data with pip's anytree 
2. By DFS , we can get an tree structure 
3. For visualizing we used streamlit's agraph library 

What’s the Limitation?

- We use Github Restapi which limits 5000 times per hour
- OPEN AI API KEY Cost
- Until now , we can get a memory in the LangChain QA Retriever, but it can't utilize based on what's in its memory.
- If the repository you want to analyze has lots of files, it can takes lots of times
 

Finally , I want to say thanks for streamlit community.
When there was something I didn’t know, I asked questions to Streamlit community
The people of streamlit community responded kindly.

2 Likes

Hi @aza1200,

Thanks for sharing! Pretty cool app.

2 Likes

Really cool, I like its simplicity.

Until now , we can get a memory in the LangChain QA Retriever, but it can’t utilize based on what’s in its memory.

What do you mean by this?

1 Like

Oh, that means …

If I ask
my chatbot to …

Although we follwed that LangChain Memory buffer → (ConversationalRetrievalChain + Memory · Issue #2303 · langchain-ai/langchain · GitHub),
If I ask Chetbot what is the most recent chatbot I asked, my chatbot can’t answer that.
That means my chatbot doesn’t have the ability to utilize the memory I asked before .

I guess the reason is in streamlit’s Characteristics that are refreshed every time when a button is clicked

We embed the code about the langchain memory, but it doesn’t work now … haha

And Here is the following code

    if not st.session_state['chat_memory']:
        st.session_state['chat_memory'] = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    memory = st.session_state['chat_memory']

    qa_chain = ConversationalRetrievalChain.from_llm(
        llm=open_ai_model,
        memory=memory,
        retriever=retriever,
        get_chat_history=lambda h : h,
        verbose=True,
    )

Thanks for your comment! :grinning: