In this tutorial series, we'll build a production-ready application end-to-end using Flask, React, OpenAI, and Pinecone and we'll implement the Retrieval Augmented Generation (RAG) framework to give the LLM necessary context based on information it wasn't trained on.
What we're building
In this series, our journey will take us through the creation of a simple web application that allows a user to input a URL and ask questions about the content on that webpage. You can find the source code for part 1 Click here.
Components of the full application:
- Backend (Flask): This handles the logic to scrape the website and call OpenAI’s Embeddings API to create embeddings from the website’s text. It also stores these embeddings in the vector database (Pinecone) and retrieves relevant text to help the LLM answer the user’s question.
- OpenAI: We'll call two different API's from OpenAI: (1) the Embeddings API to embed the text of the website as well as the user's question, and (2) the ChatCompletions API to get an answer from GPT-4 to send back. to the user.
- Pinecone: This is the vector database that we’ll use to (1) send the embeddings of the website’s text to, and (2) retrieve the most similar text chunks for constructing the prompt to send to the LLM in step 3.
- Frontend (React): This is the interface that the user interacts with to input a URL and ask questions about the webpage.
Features of the final web app:
- URL Submission: Users can enter any website URL to initiate the process.
- Contextual Querying: After submitting a URL, users can ask any question related to the content of that website.
- Real-time Responses: The application retrieves and streams responses in real-time, ensuring a dynamic user experience.
- Performance Metrics: In the final stage, users can track various metrics to evaluate the LLM’s effectiveness and accuracy in handling queries.
Step 1: The Flask skeleton
We'll only focus on the backend (Flask / Python) for part 1 of the
tutorial. First we'll set up our dev environment, build our Flask
skeleton, and establish routes to expose to the frontend. You can find
the source code for on GitHub.
Setting up the development environment
First, let's set up a virtual environment and install all requirements
for the application. Run the following commands in your terminal.
# Create the project directory and enter it mkdir YourApp cd ./YourApp # Create and activate a virtual environment # if virtualenv isn't installed, install it using # pip install virtualenv python -m venv venv source venv/bin/activate # Install dependencies from requirements.txt # Make sure you have a requirements.txt file in your project directory # See Github repository here: <URL> for the file pip install -r requirements.txt
Build Flask skeleton
Now let's set up the folder structure as follows. This neatly separates the
code to make it easier to maintain.
/YourApp /app __init__.py /api __init__.py routes.py /services __init__.py openai_service.py pinecone_service.py scraping_service.py /utils __init__.py helper_functions.py .env .gitignore requirements.txt run.py
See the contents of the run.py and requirements.txt and __init__ files in
the GitHub source code and copy them into the relevant files. We'll be
writing the code for the other files throughout the course of this tutorial.
Set up the .env file
We’re going to store our API keys for OpenAI and Pinecone in an .env file.
Create this file in the root of your directory and make sure to replace the
placeholders with your actual API keys. You can sign up for OpenAI and get
your key here. And here's the same for Pinecone.
# YourApp/.env OPENAI_API_KEY=<YOUR_OPENAI_KEY> PINECONE_API_KEY=<YOUR_PINECONE_KEY>
Create REST routes for the application
There are 2 routes for the React frontend to hit on our backend: (1)
embed_and_store which handles the first phase of creating embeddings in
Pinecone, and handle_query which gets an answer to the user’s questions.
🚨 Remember to create and populate the run.py and requirements.txt and
__init__ files in the GitHub source code. Once populated, run pip install -r
requirements.txt to install all requirements.
# YourApp/api/routes.py from. import api_blueprint @api_blueprint.route('/embed-and-store', methods=['POST']) def embed_and_store(): # handles scraping the URL, embedding the texts, and # Uploading to the vector database. pass @api_blueprint.route('/handle-query', methods=['POST']) def handle_query(): # handles embedding the user's question, # finding relevant context from the vector database, # building the prompt for the LLM, # and sending the prompt to the LLM's API to get an answer. pass
Step 2: Embeddings and vector DB (Pinecone)
Now that we have our basic backend skeleton in place, let’s build out the
initial flow when the frontend provides the URL to the backend.
Scraping the website using BeautifulSoup
We'll use the excellent BeautifulSoup library to scrape the URL and retrieve
the text contents of the page.
# YourApp/services/scraping_service.py import requests from bs4 import BeautifulSoup def scrape_website(url): response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') text = soup.get_text(separator='\n') return text
Chunk the text, create embeddings, and upload to Pinecone
We want to split the text into chunks. This way we can create embeddings for
each chunk and only retrieve the most relevant chunks of text from the
vector DB when the user asks a question. Embeddings are critical for the RAG
framework to work; read more about embeddings here.
First, let’s create the chunk_text function in our helper_functions.py file.
# YourApp/utils/helper_functions.py def chunk_text(text, chunk_size=200): # Split the text by sentences to avoid breaking in the middle of a sentence sentences = text.split('. ') chunks = [] current_chunk = "" for sentence in sentences: # Check if adding the next sentence exceeds the chunk size if len(current_chunk) + len(sentence) <= chunk_size: current_chunk += sentence + '. ' else: # If the chunk reaches the desired size, add it to the chunks list chunks.append(current_chunk) current_chunk = sentence + '. ' # Add the last chunk if it's not empty if current_chunk: chunks.append(current_chunk) return chunks
Now let’s build the function to get embeddings for a given chunk given OpenAI.
These are what we'll store in the vector database, along with the text for
that chunk.
# YourApp/services/openai_service.py import os import json import requests from openai import OpenAI OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY') OPENAI_EMBEDDING_MODEL = 'text-embedding-ada-002' def get_embedding(chunk): url = 'https://api.openai.com/v1/embeddings' headers = { 'content-type': 'application/json; charset=utf-8', 'Authorization': f"Bearer {OPENAI_API_KEY}" } data = { 'model': OPENAI_EMBEDDING_MODEL, 'input': chunk } response = requests.post(url, headers=headers, data=json.dumps(data)) response_json = response.json() embedding = response_json["data"][0]["embedding"] return embedding
Next, we’ll build the function to embed all these chunks and upload these
embeddings to a vector database (Pinecone).
# YourApp/services/pinecone_service.py import pinecone from app.services.openai_service import get_embedding import os PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY') # make sure to enter your actual Pinecone environment pinecone.init(api_key=PINECONE_API_KEY, environment='gcp-starter') EMBEDDING_DIMENSION = 1536 def embed_chunks_and_upload_to_pinecone(chunks, index_name): # delete the index if it already exists. # as Pinecone's free plan only allows one index if index_name in pinecone.list_indexes(): pinecone.delete_index(name=index_name) # create a new index in Pinecone # the EMBEDDING_DIMENSION is based on what the # OpenAI embedding model outputs pinecone.create_index(name=index_name, dimension=EMBEDDING_DIMENSION, metric='cosine') index = pinecone.Index(index_name) # embed each chunk and aggregate these embeddings embeddings_with_ids = [] for i, chunk in enumerate(chunks): embedding = get_embedding(chunk) embeddings_with_ids.append((str(i), embedding, chunk)) # upload the embeddings and relevant texts for each chunk # to the pinecone index upserts = [(id, vec, {"chunk_text": text}) for id, vec, text in embeddings_with_ids] index.upsert(vectors=upserts)
That's it! Our application can now handle scraping the website, creating text
chunks from the content of the website, embedding these chunks, and uploading
them to the vector database. Let's put it all together in the embed_and_store
route.
# Updated "YourApp/api/routes.py" from. import api_blueprint from flask import request, jsonify from app.services import openai_service, pinecone_service, scraping_service from app.utils.helper_functions import chunk_text # Sample index name since we're only creating a single index PINECONE_INDEX_NAME = 'index237' @api_blueprint.route('/embed-and-store', methods=['POST']) def embed_and_store(): url = request.json['url'] url_text = scraping_service.scrape_website(url) chunks = chunk_text(url_text) pinecone_service.embed_chunks_and_upload_to_pinecone(chunks, PINECONE_INDEX_NAME) response_json = { "message": "Chunks embedded and stored successfully" } return jsonify(response_json) @api_blueprint.route('/handle-query', methods=['POST']) def handle_query(): # handles embedding the user's question, # finding relevant context from the vector database, # building the prompt for the LLM, # and sending the prompt to the LLM's API to get an answer. pass
Step 3: RAG, prompt construction, and GPT-4
Now that we’ve completed the first phase of our flow, let’s move onto the
next phase when the user asks a question in the chat interface about the
website.
Retrieve context from the vector database
Assuming the handle_query endpoint receives a url as part of it request
payload, let’s first find the relevant chunks of context to provide to the
LLM to help answer the question. This is the bulk of the work to implement
the RAG framework — you can read a lot more about RAG here.
# YourApp/services/pinecone_service.py # ... leave the rest of the file unchanged # and add the following. def get_most_similar_chunks_for_query(query, index_name): question_embedding = get_embedding(query) index = pinecone.Index(index_name) query_results = index.query(question_embedding, top_k=3, include_metadata=True) context_chunks = [x['metadata']['chunk_text'] for x in query_results['matches']] return context_chunks
Here's what's going on in this code:
- First, we embed the question through OpenAI’s embedding model using the same get_embedding function that we’ve used before.
- Then we query our Pinecone vector database that we created in Step 2 to find the top 3 chunks of text that are most similar to the embedded question from step 1. This is done through a cosine similarity algorithm.
- The function returns the texts of those chunks as a list.
Build the prompt for the LLM using this context
Now comes the fun part, commonly referred to as prompt engineering. We’re
going to combine the user’s question, the relevant context, and some
instructions for the LLM into a single text prompt.
# YourApp/utils/helper_functions.py # add this to the top of the file: PROMPT_LIMIT = 3750 # ... keep all the current content of the file # and add the following code def build_prompt(query, context_chunks): # create the start and end of the prompt prompt_start = ( "Answer the question based on the context below. If you don't know the answer based on the context provided below, just respond with 'I don't know' instead of making up an answer. Return just the answer to the question, Don't add anything else. Don't start your response with the word 'Answer:'." "Context:\n" ) prompt_end = ( f"\n\nQuestion: {query}\nAnswer:" ) # append context chunks until we hit the # limit of tokens we want to send to the prompt. prompt="" for i in range(1, len(context_chunks)): if len("\n\n---\n\n".join(context_chunks[:i])) >= PROMPT_LIMIT: prompt = ( prompt_start + "\n\n---\n\n".join(context_chunks[:i-1]) + prompt_end ) break elif i == len(context_chunks)-1: prompt = ( prompt_start + "\n\n---\n\n".join(context_chunks) + prompt_end ) return prompt
Now we have a full prompt that includes:
- Instructions for the LLM
- Context to help the LLM answer the question
- The user's question itself
Getting an answer from the LLM (GPT-4)
Now comes the part we’ve been waiting for: we’re going to provide the prompt
to the LLM (OpenAI’s GPT-4) and get back the answer to the user’s question
to display in the front-end!
# YourApp/services/openai_service.py # add this to the top of the file under the imports CHATGPT_MODEL = 'gpt-4-1106-preview' # ... keep the rest of the file unchanged # and add the following code: def get_llm_answer(prompt): # Aggregate a messages array to send to the LLM messages = [{"role": "system", "content": "You are a helpful assistant."}] messages.append({"role": "user", "content": prompt}) # Send the payload to the LLM to retrieve an answer url = 'https://api.openai.com/v1/chat/completions' headers = { 'content-type': 'application/json; charset=utf-8', 'Authorization': f"Bearer {OPENAI_API_KEY}" } data = { 'model': CHATGPT_MODEL, 'messages': messages, 'temperature': 1, 'max_tokens': 1000 } response = requests.post(url, headers=headers, data=json.dumps(data)) # return the final answer response_json = response.json() completion = response_json["choices"][0]["message"]["content"] return completion
Here's what's going on in this code:
- First, we create a messages array to send to the LLM as part of our request. Since we’re using the ChatCompletions API, we’ll need to treat this as if we’re sending a list of chat conversations even though we only have one question we need answered. This becomes important when we want to provide the entire chat history to the LLM (this will be covered in later tutorials).
- Then we send this payload, including the messages of the constructed prompt, to the ChatCompletions API from OpenAI.
- We get back a text response from the LLM with the answer to the question.
That's it! Putting it all together, here’s our completeroutes.py file:
# Updated "YourApp/api/routes.py" from. import api_blueprint from flask import request, jsonify from app.services import openai_service, pinecone_service, scraping_service from app.utils.helper_functions import chunk_text, build_prompt PINECONE_INDEX_NAME = 'index237' @api_blueprint.route('/handle-query', methods=['POST']) def handle_query(): question = request.json['question'] context_chunks = pinecone_service.get_most_similar_chunks_for_query(question, PINECONE_INDEX_NAME) prompt = build_prompt(question, context_chunks) answer = openai_service.get_llm_answer(prompt) return jsonify({ "question": question, "answer": answer }) @api_blueprint.route('/embed-and-store', methods=['POST']) def embed_and_store(): url = request.json['url'] url_text = scraping_service.scrape_website(url) chunks = chunk_text(url_text) pinecone_service.embed_chunks_and_upload_to_pinecone(chunks, PINECONE_INDEX_NAME) response_json = { "message": "Chunks embedded and stored successfully" } return jsonify(response_json)
Test the implementation using cURL
To make sure everything is working properly, let’s start the server locally
and test that these endpoints work properly. We'll be using a sample URL to
test out the application (https://dev.to/bobur/how-to-build-a-custom-gpt-enabled-full-stack-app-for-real-time-data-38k8)
In a terminal window at the root of your application directory, run the
python run.py command to start the server.
Now let's test the embed_and_store endpoint. Open a new terminal window and
perform the following command:
curl -X POST http://localhost:5000/embed-and-store\ -H "Content-Type: application/json" \ -d '{"url":"https://dev.to/bobur/how-to-build-a-custom-gpt-enabled-full-stack-app-for-real-time-data-38k8"} ' # ...after a few seconds, you should see this in the terminal: { "message": "Chunks embedded and stored successfully" }
We have our vector store populated! Now let’s try asking a question using the
handle-query endpoint:
curl -X POST http://localhost:5000/handle-query\ -H "Content-Type: application/json" \ -d '{"question":"Why do we provide ChatGPT with a custom knowledge base?"}' # ...after a few seconds, you should see this: { "answer": "To enhance its ability to deliver more accurate and context-specific information in its responses, which is particularly useful for specialized applications such as the sample app designed for finding real-time discount prices.", "question": "Why do we provide ChatGPT with a custom knowledge base?" }
Conclusion and what's next
Our backend works! So far in this tutorial, we've implemented the following:
- Backend (Flask): We set up the entire backend for the web application. This handles the tasks below.
- Scrape, embed, and store vectors: We used OpenAI’s Embeddings API to embed the contents of a given website (in chunks) and store them in the Pinecone vector database.
- RAG: When the user asks a question, our backend finds the most relevant context chunks from the vector database, builds a prompt, and gets an answer from an LLM (OpenAI’s GPT-4).
You can find the source code for this tutorial at this branch on
GitHub. In
the next post, we'll build the React frontend to interact with the
backend we've built here. We’ll also learn how to stream responses from the
LLM to the user and incorporate the entire chat history to send to the LLM.