Search Paul Graham essays with Siri — Building an embedding powered product in few lines of code

https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8

Searching Paul Graham essays with Siri

  1. Deploy Embedbase locally or on Google Cloud Run
  2. Build a crawler for Paul Graham essays using Crawlee that ingest data into Embedbase
  3. Build an Apple Siri Shortcut that lets you search Paul Graham essays’ through Embedbase with voice & natural language

Build time!

Tech stack

Cloning the repo

git clone https://github.com/another-ai/embedbase
cd embedbase

Setting up Pinecone

Creating a Pinecone index
Pinecone index configuration
Getting Pinecone API key

Configuring OpenAI

Creating an OpenAI key
Getting OpenAI organization ID

Creating your Embedbase config

# embedbase/config.yaml
# https://app.pinecone.io/
pinecone_index: "my index name"
# replace this with your environment
pinecone_environment: "us-east1-gcp"
pinecone_api_key: ""

# https://platform.openai.com/account/api-keys
openai_api_key: "sk-xxxxxxx"
# https://platform.openai.com/account/org-settings
openai_organization: "org-xxxxx"# https://platform.openai.com/account/api-keys
openai_api_key: "sk-xxxxxxx"
# https://platform.openai.com/account/org-settings
openai_organization: "org-xxxxx"

Running Embedbase

Starting Docker on Mac
docker-compose up

(Optional) Cloud deployment

# login to gcloud
gcloud auth login

# Get your Google Cloud project ID
PROJECT_ID=$(gcloud config get-value project)

# Enable container registry
gcloud services enable containerregistry.googleapis.com

# Enable Cloud Run
gcloud services enable run.googleapis.com

# Enable Secret Manager
gcloud services enable secretmanager.googleapis.com

# create a secret for the config
gcloud secrets create EMBEDBASE_PAUL_GRAHAM --replication-policy=automatic

# add a secret version based on your yaml config
gcloud secrets versions add EMBEDBASE_PAUL_GRAHAM --data-file=config.yaml

# Set your Docker image URL
IMAGE_URL="gcr.io/${PROJECT_ID}/embedbase-paul-graham:0.0.1"

# Build the Docker image for cloud deployment
docker buildx build . --platform linux/amd64 -t ${IMAGE_URL} -f ./search/Dockerfile

# Push the docker image to Google Cloud Docker registries
# Make sure to be authenticated https://cloud.google.com/container-registry/docs/advanced-authentication
docker push ${IMAGE_URL}

# Deploy Embedbase to Google Cloud Run
gcloud run deploy embedbase-paul-graham \
--image ${IMAGE_URL} \
--region us-central1 \
--allow-unauthenticated \
--set-secrets /secrets/config.yaml=EMBEDBASE_PAUL_GRAHAM:1

Building the Paul Graham essays’ crawler

git clone https://github.com/another-ai/embedbase-paul-graham
cd embedbase-paul-graham
npm i
// src/main.ts

// Here we want to start from the page that list all Paul's essays
const startUrls = ['http://www.paulgraham.com/articles.html'];

const crawler = new PlaywrightCrawler({
requestHandler: router,
});

await crawler.run(startUrls);
// src/routes.ts
router.addDefaultHandler(async ({ enqueueLinks, log }) => {
log.info(`enqueueing new URLs`);
await enqueueLinks({
// Here we tell the crawler to only accept pages that are under
// "http://www.paulgraham.com/" domain name,
// for example if we find a link on Paul's website to an url
// like "https://ycombinator.com/startups" if it will ignored
globs: ['http://www.paulgraham.com/**'],
label: 'detail',
});
});

router.addHandler('detail', async ({ request, page, log }) => {
// Here we will do some logic on all pages under
// "http://www.paulgraham.com/" domain name

// for example, collecting the page title
const title = await page.title();

// getting the essays' content
const blogPost = await page.locator('body > table > tbody > tr > td:nth-child(3)').textContent();
if (!blogPost) {
log.info(`no blog post found for ${title}, skipping`);
return;
}
log.info(`${title}`, { url: request.loadedUrl });

// Remember that usually AI models and databases have some limits in input size
// and thus we will split essays in chunks of paragraphs
// split blog post in chunks on the \n\n
const chunks = blogPost.split(/\n\n/);
if (!chunks) {
log.info(`no blog post found for ${title}, skipping`);
return;
}
// If you are not familiar with Promises, don't worry for now
// it's just a mean to do things faster
await Promise.all(chunks.flatMap((chunk) => {
const d = {
url: request.loadedUrl,
title: title,
blogPost: chunk,
};
// Here we just want to send the page interesting
// content into Embedbase (don't mind Dataset, it's optional local storage)
return Promise.all([Dataset.pushData(d), add(title, chunk)]);
}));
});
const add = (title: string, blogPost: string) => {
// note "paul" in the URL, it can be anything you want
// that will help you segment your data in
// isolated parts
const url = `${baseUrl}/v1/paul`;
const data = {
documents: [{
data: blogPost,
}],
};
// send the data to Embedbase using "node-fetch" library
fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(data),
}).then((response) => {
return response.json();
}).then((data) => {
console.log('Success:', data);
}).catch((error) => {
console.error('Error:', error);
});
};
npm start
# you can get your cloud run URL like this:
CLOUD_RUN_URL=$(gcloud run services list --platform managed --region us-central1 --format="value(status.url)" --filter="metadata.name=embedbase-paul-graham")
npm run playground ${CLOUD_RUN_URL}

(Optional) Searching through Embedbase in your terminal

// src/playground.ts
const search = async (query: string) => {
const url = `${baseUrl}/v1/paul/search`;
const data = {
query,
};
return fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify(data),
}).then((response) => {
return response.json();
}).then((data) => {
console.log('Success:', data);
}).catch((error) => {
console.error('Error:', error);
});
};

const p = prompt();

// this is an interactive terminal that let you search in paul graham
// blog posts using semantic search
// It is an infinite loop that will ask you for a query
// and show you the results
const start = async () => {
console.log('Welcome to the Embedbase playground!');
console.log('This playground is a simple example of how to use Embedbase');
console.log('Currently using Embedbase server at', baseUrl);
console.log('This is an interactive terminal that let you search in paul graham blog posts using semantic search');
console.log('Try to run some queries such as "how to get rich"');
console.log('or "how to pitch investor"');
while (true) {
const query = p('Enter a semantic query:');
if (!query) {
console.log('Bye!');
return;
}
await search(query);
}
};

start();
npm run playground
npm run playground ${CLOUD_RUN_URL}

(Optional) Building Apple Siri Shortcut

Starting Apple Shortcuts
Creating a new Apple Siri Shortcuts
The first part of the shortcut
  1. “Dictate text” let you ask your search query with voice (choose English language)
  2. We store the endpoint of Embedbase in a “Text” for clarity, change according to your setup (“https://localhost:8000/v1/search” if you run Embedbase locally)
  3. We set the endpoint in a variable again for clarity
  4. Same for the dictated text
  5. Now “Get contents of” will do an HTTP POST request to Embedbase using our previously defined during the crawling “vault_id” as “paul” and use the variable “query” for the “query” property
The last part of the shortcut
  1. “Get for in” will extract the property “similarities” from the Embedbase response
  2. “Repeat with each item in” will, for each similarities:
  3. Get the “document_path” property
  4. Add to a variable “paths” (a list)
  5. “Combine” will “join” the results with a new line
  6. (Optional, will show how below) This is a fun trick you can add to the shortcut for spiciness, using OpenAI GPT3 to transform a bit of the result text to sounds better when Siri pronounce it
  7. We assemble the result into a “Text” to be voice-friendly
  8. Ask Siri to speak it
A functional GPT3 shortcut to transform anything

Conclusion

--

--

Founder | 🤖 Software 3.0 | Techstars 22 | OrangeDAO 23 | Building https://github.com/different-ai/embedbase

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store