Finding your look-alikes with semantic search

Do you want to find like-minded people on Hacker News with a similar commenting history?

We've got you covered!

In this post, you'll learn how to build a Doppelgänger app in three simple steps:

  1. Create a vector database in Pinecone.
  2. Build an app in Streamlit.
  3. Combine the two together.

Can't wait and want to see how it works? Try the app right here.

But before we get into building it, let's answer one question...

Why a Doppelgänger app?

Searching for your celebrity doppelgänger isn’t a new idea. In fact, it’s so unoriginal that no one has updated the celebrity-face dataset in three years!

But we weren't looking for celebrities. We were looking for users with matching comment histories—Hacker News "celebrities" like patio11, tptacek, and pc.

At Pinecone, we've built a vector database that makes it easy to add semantic search to production applications. We were intrigued by the idea of making a semantic search app for Hacker News. Could it compare the semantic meaning of your commenting history with the histories of all the other users?

So we thought, "How about the doppelgänger idea but for Hacker News?"

It took a few hours to build it. Most of that time went to converting raw data into vector embeddings (more below) and to debating which users to feature as examples. The app got a lot of attention on Hacker News (Surprise!), getting thousands of visitors and 215 comments. Many people asked how it works, so here's an inside look at how we made it and how you can make your own version.

Step 1. Create a database in Pinecone

1. Retrieve the data

Collect the data from the publicly available dataset on BigQuery. Get every comment and story from every user that hasn't been deleted or labeled as "dead" in the last three years (stories and comments killed by software, moderators, or user flags).

2. Prepare the data

Collect and merge all available data for each user—with no additional processing steps and no weights added to comments or stories.

You'll face two limitations:

  1. Caring about all comments and stories equally.
  2. Capturing exactly why a user was matched with someone else if they've changed interests in the last three years.

3. Create embeddings

Create a single embedding for each user with the help of the average word embeddings of Komninos and Manandhar (it took us three hours). This algorithm works much faster when compared to other state-of-the-art approaches (such as the commonly used BERT model).

4. Insert the data

Create a new vector index and insert the data. Our total index size (the number of inserted embeddings) was around 230,000. We used cosine similarity as it's more intuitive and widely used with word vectors. Each data point was represented as a single tuple that contained a user ID and a corresponding vector. Each vector contained 300 dimensions or “features.”

5. Query Pinecone

Fetch an embedding of a user ID and query Pinecone by providing that embedding. Pinecone will return the top 10 users with the most similar embeddings.

Note: The user experience can be improved by focusing on more recent data and by taking age and karma (points on Hacker News) into consideration when creating embeddings.

Step 2.  Build the app in Streamlit

The above summarized the data preparation and the database configuration steps. See the Pinecone quickstart guide for the step-by-step instructions on setting up a vector database.

With the data vectorized and loaded into Pinecone, you can now build an app in Streamlit to let anyone query that database through the browser.

1. Install Streamlit

Install Streamlit by running:

pip install streamlit

To see some examples of what Streamlit is capable of, run:

streamlit hello

2. Create a base Streamlit app

Create a base class to represent your Streamlit app. It'll contain a store and an effect object. You'll use the effect object to initialize Pinecone and to save the index name in the store. Next, add a render method to handle the page layout.

In a Streamlit app, each user action prompts the screen to be cleared and the main function to be run. Create the app and call render. In render, use st.title to display a title, then call render on the home page.

import streamlit as st
class App:
	title = "Hacker News Doppelgänger"
	def __init__(self):
		self.store = AppStore()
		self.effect = AppEffect(self.store)
		self.effect.init_pinecone()
	def render(self):
		st.title(self.title)
		PageHome(self).render()
if __name__ == "__main__":
	App().render()

3. Create Store and Effects

The store will be used to hold all the data needed to connect to Pinecone. To connect to a Pinecone index, you'll need your API key and the name of your index. You'll take this data from environment variables.

To set these locally, run:

export PINECONE_API_KEY=<api-key> && export PINECONE_INDEX_NAME=<index-name>

These can be set in a published Streamlit app during the creation process or by changing the settings on a running app:

import os
from dataclasses import dataclass
API_KEY = os.getenv("PINECONE_API_KEY")
INDEX_NAME = os.getenv("PINECONE_INDEX_NAME")
@dataclass
class AppStore:
	api_key = API_KEY
	index_name = INDEX_NAME

Use the AppEffect class to connect your app to Pinecone (with init) and to the index (docs):

class AppEffect:
	
	def __init__(self, store: AppStore):
		self.store = store
	def init_pinecone(self):
		pinecone.init(api_key=self.store.api_key)
	def init_pinecone_index(self):
		return pinecone.Index(self.store.index_name)

4. Layout the page

Create and fill out the render method of the PageHome class.

First, use st.markdown to display instructions. Under it, display the buttons for suggested usernames. Use st.beta_columns to organize Streamlit elements in columns and st.button to place a clickable button on the page.

If the app's last action was clicking on that button, then st.button will return True. Save the value of that user in st.session_state (to save and use this value between renderings):

def render_suggested_users(self):
	st.markdown("Try one of these users:")
	columns = st.beta_columns(len(SUGGESTED_USERNAMES))
	for col, user in zip(columns, SUGGESTED_USERNAMES):
		with col:
			if st.button(user):
				st.session_state.username = user

Below the suggested users, show a text entry where the user can enter any username and a submit button which they can click on, to search.

To do this, use st.form with st.text_input and st.form_submit_buttonm. If you have a selected username saved in st.session_state.markdown, put that value in the text box. Otherwise, leave it empty for user input.

Now, return the value from st.form_submit_button . It'll return true if the user clicked the submit button on the last run:

def render_search_form(self):
	st.markdown("Or enter a username:")
	with st.form("search_form"):
		if st.session_state.get('username'):
			st.session_state.username = st.text_input("Username", value=st.session_state.username)
		else:
			st.session_state.username = st.text_input("Username")
		return st.form_submit_button("Search")

Once the user searches, render the results. Use st.spinner to show a progress indicator to the user while loading the results. Because of Pinecone's blazing-fast search speeds, the loading icon won't be visible for long!

To complete the search, fetch the user from your Pinecone index using the entered username as the ID. No vector for the user? That means they didn't have any activity on Hacker News in the last three years, so you'll see an error message.

If you find a user, query Pinecone for the closest matches. Use a Markdown table to display the results and include a link to their Hacker News comment history as well as the proximity score for each result:

def render_search_results(self):
	with st.spinner("Searching for " + st.session_state.username):
		result = self.index.fetch(ids=st.session_state.username)
		has_user = len(result.vector) != 0
	if !has_user:
		return st.markdown("This user does not exist or does not have any recent activity.")
	with st.spinner("Found user history, searching for doppelgänger"):
		closest = self.index.queries(queries=result.vector, top_k=11)
	results = [{'username': id, 'score': round(score, 3)}
			for id, score in zip(closest.ids, closest.scores)
			if id != st.session_state.username][:10]
	result_strings = "\\n".join([
f"|[{result.get('username')}](<https://news.ycombinator.com/threads?id={result.get('username')}>)|{result.get('score')}|" for result in results
])
	markdown = f"""
	| Username | Similarity Score |
	|----------|------:|
	{result_strings}
	"""
	with st.beta_container():
		st.markdown(markdown)

Step 3. Combine the two together

You're almost done! All that's left is to tie it all together in a single render method:

class PageHome:
	def __init__(self, app):
		self.app = app
	
	@property
	def index(self):
		return self.app.effect.init_pinecone_index()
	def render(self):
		self.render_suggested_users()
		submitted = self.render_search_form()
		if submitted:
			self.render_search_results()```

Congratulations! 🥳

You now have a fully functioning Hacker News Doppelgänger app. Run streamlit.app.py and navigate to localhost:8051 to see your app in action.

Wrapping up

Thank you for reading this post. We're very excited to have shared this with you and we hope this inspires you to build your own semantic search application with Pinecone and Streamlit.

Have questions or improvement ideas? Please leave them in the comments below or send them to info@pinecone.io or @pinecone_io.

Happy app-building! 🎈


This is a companion discussion topic for the original entry at https://blog.streamlit.io/p/0052727b-3e86-4cfd-9a78-0d212469f1a9/