Build a LLM application with OpenAI, Flask and Pinecone

We'll build a production-ready application end-to-end using Flask OpenAI, and Pinecone and we'll implement the RAG framework to LLM

 In this tutorial series, we'll build a production-ready application end-to-end using Flask, React, OpenAI, and Pinecone and we'll implement the Retrieval Augmented Generation (RAG) framework to give the LLM necessary context based on information it wasn't trained on.

Build a LLM application with OpenAI, Flask and Pinecone


What we're building

In this series, our journey will take us through the creation of a simple web application that allows a user to input a URL and ask questions about the content on that webpage. You can find the source code for part 1 Click here.


Components of the full application:

  • Backend (Flask): This handles the logic to scrape the website and call OpenAI’s Embeddings API to create embeddings from the website’s text. It also stores these embeddings in the vector database (Pinecone) and retrieves relevant text to help the LLM answer the user’s question.
  • OpenAI: We'll call two different API's from OpenAI: (1) the Embeddings API to embed the text of the website as well as the user's question, and (2) the ChatCompletions API to get an answer from GPT-4 to send back. to the user.
  • Pinecone: This is the vector database that we’ll use to (1) send the embeddings of the website’s text to, and (2) retrieve the most similar text chunks for constructing the prompt to send to the LLM in step 3.
  • Frontend (React): This is the interface that the user interacts with to input a URL and ask questions about the webpage.

Features of the final web app:

  • URL Submission: Users can enter any website URL to initiate the process.
  • Contextual Querying: After submitting a URL, users can ask any question related to the content of that website.
  • Real-time Responses: The application retrieves and streams responses in real-time, ensuring a dynamic user experience.
  • Performance Metrics: In the final stage, users can track various metrics to evaluate the LLM’s effectiveness and accuracy in handling queries.

Step 1: The Flask skeleton


We'll only focus on the backend (Flask / Python) for part 1 of the tutorial. First we'll set up our dev environment, build our Flask skeleton, and establish routes to expose to the frontend. You can find the source code for on GitHub.

Setting up the development environment

First, let's set up a virtual environment and install all requirements for the application. Run the following commands in your terminal.

   # Create the project directory and enter it
    mkdir YourApp
    cd ./YourApp
    
    # Create and activate a virtual environment
    # if virtualenv isn't installed, install it using
    # pip install virtualenv
    python -m venv venv
    source venv/bin/activate
    
    # Install dependencies from requirements.txt
    # Make sure you have a requirements.txt file in your project directory
    # See Github repository here: <URL> for the file
    pip install -r requirements.txt
    
Build Flask skeleton

Now let's set up the folder structure as follows. This neatly separates the code to make it easier to maintain.

/YourApp
    /app
        __init__.py
        /api
            __init__.py
            routes.py
        /services
            __init__.py
            openai_service.py
            pinecone_service.py
            scraping_service.py
        /utils
            __init__.py
            helper_functions.py
    .env
    .gitignore
    requirements.txt
    run.py
See the contents of the run.py and requirements.txt and __init__ files in the GitHub source code and copy them into the relevant files. We'll be writing the code for the other files throughout the course of this tutorial.

Set up the .env file

We’re going to store our API keys for OpenAI and Pinecone in an .env file. Create this file in the root of your directory and make sure to replace the placeholders with your actual API keys. You can sign up for OpenAI and get your key here. And here's the same for Pinecone.

# YourApp/.env
  
 OPENAI_API_KEY=<YOUR_OPENAI_KEY>
 PINECONE_API_KEY=<YOUR_PINECONE_KEY>
    

Create REST routes for the application

There are 2 routes for the React frontend to hit on our backend: (1) embed_and_store which handles the first phase of creating embeddings in Pinecone, and handle_query which gets an answer to the user’s questions.

🚨 Remember to create and populate the run.py and requirements.txt and __init__ files in the GitHub source code. Once populated, run pip install -r requirements.txt to install all requirements.

    # YourApp/api/routes.py
  
    from. import api_blueprint

    @api_blueprint.route('/embed-and-store', methods=['POST'])
    def embed_and_store():
        # handles scraping the URL, embedding the texts, and
        # Uploading to the vector database.
        pass
    
    @api_blueprint.route('/handle-query', methods=['POST'])
    def handle_query():
        # handles embedding the user's question,
        # finding relevant context from the vector database,
        # building the prompt for the LLM,
        # and sending the prompt to the LLM's API to get an answer.
        pass
    

Step 2: Embeddings and vector DB (Pinecone)


Now that we have our basic backend skeleton in place, let’s build out the initial flow when the frontend provides the URL to the backend.

Scraping the website using BeautifulSoup

We'll use the excellent BeautifulSoup library to scrape the URL and retrieve the text contents of the page.

# YourApp/services/scraping_service.py
  
import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    text = soup.get_text(separator='\n')
    return text
Chunk the text, create embeddings, and upload to Pinecone

We want to split the text into chunks. This way we can create embeddings for each chunk and only retrieve the most relevant chunks of text from the vector DB when the user asks a question. Embeddings are critical for the RAG framework to work; read more about embeddings here.

First, let’s create the chunk_text function in our helper_functions.py file.

# YourApp/utils/helper_functions.py
  
def chunk_text(text, chunk_size=200):
# Split the text by sentences to avoid breaking in the middle of a sentence
sentences = text.split('. ')
chunks = []
current_chunk = ""
for sentence in sentences:
    # Check if adding the next sentence exceeds the chunk size
    if len(current_chunk) + len(sentence) <= chunk_size:
        current_chunk += sentence + '. '
    else:
        # If the chunk reaches the desired size, add it to the chunks list
        chunks.append(current_chunk)
        current_chunk = sentence + '. '
# Add the last chunk if it's not empty
if current_chunk:
    chunks.append(current_chunk)
return chunks
Now let’s build the function to get embeddings for a given chunk given OpenAI. These are what we'll store in the vector database, along with the text for that chunk.

# YourApp/services/openai_service.py
  
import os
import json
import requests
from openai import OpenAI

OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
OPENAI_EMBEDDING_MODEL = 'text-embedding-ada-002'

def get_embedding(chunk):
    url = 'https://api.openai.com/v1/embeddings'
    headers = {
        'content-type': 'application/json; charset=utf-8',
        'Authorization': f"Bearer {OPENAI_API_KEY}"
    }
    data = {
        'model': OPENAI_EMBEDDING_MODEL,
        'input': chunk
    }
    response = requests.post(url, headers=headers, data=json.dumps(data))
    response_json = response.json()
    embedding = response_json["data"][0]["embedding"]
    return embedding
Next, we’ll build the function to embed all these chunks and upload these embeddings to a vector database (Pinecone).

# YourApp/services/pinecone_service.py
  
import pinecone
from app.services.openai_service import get_embedding
import os

PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')

# make sure to enter your actual Pinecone environment
pinecone.init(api_key=PINECONE_API_KEY, environment='gcp-starter')

EMBEDDING_DIMENSION = 1536

def embed_chunks_and_upload_to_pinecone(chunks, index_name):
    
    # delete the index if it already exists.
    # as Pinecone's free plan only allows one index
    if index_name in pinecone.list_indexes():
        pinecone.delete_index(name=index_name)

    # create a new index in Pinecone
    # the EMBEDDING_DIMENSION is based on what the
    # OpenAI embedding model outputs
    pinecone.create_index(name=index_name,
                        dimension=EMBEDDING_DIMENSION, metric='cosine')
    index = pinecone.Index(index_name)
    # embed each chunk and aggregate these embeddings
    embeddings_with_ids = []
    for i, chunk in enumerate(chunks):
        embedding = get_embedding(chunk)
        embeddings_with_ids.append((str(i), embedding, chunk))
    # upload the embeddings and relevant texts for each chunk
    # to the pinecone index
    upserts = [(id, vec, {"chunk_text": text}) for id, vec, text in embeddings_with_ids]
    index.upsert(vectors=upserts)
That's it! Our application can now handle scraping the website, creating text chunks from the content of the website, embedding these chunks, and uploading them to the vector database. Let's put it all together in the embed_and_store route.

# Updated "YourApp/api/routes.py"
  
from. import api_blueprint
from flask import request, jsonify
from app.services import openai_service, pinecone_service, scraping_service
from app.utils.helper_functions import chunk_text

# Sample index name since we're only creating a single index
PINECONE_INDEX_NAME = 'index237'

@api_blueprint.route('/embed-and-store', methods=['POST'])
def embed_and_store():
    url = request.json['url']
    url_text = scraping_service.scrape_website(url)
    chunks = chunk_text(url_text)
    pinecone_service.embed_chunks_and_upload_to_pinecone(chunks, PINECONE_INDEX_NAME)
    response_json = {
        "message": "Chunks embedded and stored successfully"
    }
    return jsonify(response_json)

@api_blueprint.route('/handle-query', methods=['POST'])
def handle_query():
# handles embedding the user's question,
# finding relevant context from the vector database,
# building the prompt for the LLM,
# and sending the prompt to the LLM's API to get an answer.
    pass

Step 3: RAG, prompt construction, and GPT-4

Now that we’ve completed the first phase of our flow, let’s move onto the next phase when the user asks a question in the chat interface about the website.

Retrieve context from the vector database

Assuming the handle_query endpoint receives a url as part of it request payload, let’s first find the relevant chunks of context to provide to the LLM to help answer the question. This is the bulk of the work to implement the RAG framework — you can read a lot more about RAG here.

# YourApp/services/pinecone_service.py
  
# ... leave the rest of the file unchanged
# and add the following.

def get_most_similar_chunks_for_query(query, index_name):
    question_embedding = get_embedding(query)
    index = pinecone.Index(index_name)
    query_results = index.query(question_embedding, top_k=3, include_metadata=True)
    context_chunks = [x['metadata']['chunk_text'] for x in query_results['matches']]
    return context_chunks
Here's what's going on in this code:
  1. First, we embed the question through OpenAI’s embedding model using the same get_embedding function that we’ve used before.
  2. Then we query our Pinecone vector database that we created in Step 2 to find the top 3 chunks of text that are most similar to the embedded question from step 1. This is done through a cosine similarity algorithm.
  3. The function returns the texts of those chunks as a list.
Build the prompt for the LLM using this context

Now comes the fun part, commonly referred to as prompt engineering. We’re going to combine the user’s question, the relevant context, and some instructions for the LLM into a single text prompt.

# YourApp/utils/helper_functions.py
  
# add this to the top of the file:
PROMPT_LIMIT = 3750

# ... keep all the current content of the file

# and add the following code
def build_prompt(query, context_chunks):

    # create the start and end of the prompt
    prompt_start = (
        "Answer the question based on the context below. If you don't know the answer based on the context provided below, just respond with 'I don't know' instead of making up an answer. Return just the answer to the question, Don't add anything else. Don't start your response with the word 'Answer:'."
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )

    # append context chunks until we hit the
    # limit of tokens we want to send to the prompt.
    prompt=""
    for i in range(1, len(context_chunks)):
        if len("\n\n---\n\n".join(context_chunks[:i])) >= PROMPT_LIMIT:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context_chunks[:i-1]) +
                prompt_end
            )
            break
        elif i == len(context_chunks)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context_chunks) +
                prompt_end
            )
    return prompt
Now we have a full prompt that includes:
  1. Instructions for the LLM
  2. Context to help the LLM answer the question
  3. The user's question itself
Getting an answer from the LLM (GPT-4)

Now comes the part we’ve been waiting for: we’re going to provide the prompt to the LLM (OpenAI’s GPT-4) and get back the answer to the user’s question to display in the front-end!

# YourApp/services/openai_service.py
  
# add this to the top of the file under the imports
CHATGPT_MODEL = 'gpt-4-1106-preview'

# ... keep the rest of the file unchanged
# and add the following code:

def get_llm_answer(prompt):
    # Aggregate a messages array to send to the LLM
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    messages.append({"role": "user", "content": prompt})
    # Send the payload to the LLM to retrieve an answer
    url = 'https://api.openai.com/v1/chat/completions'
    headers = {
        'content-type': 'application/json; charset=utf-8',
        'Authorization': f"Bearer {OPENAI_API_KEY}"
        }
    data = {
        'model': CHATGPT_MODEL,
        'messages': messages,
        'temperature': 1,
        'max_tokens': 1000
        }
    response = requests.post(url, headers=headers, data=json.dumps(data))

    # return the final answer
    response_json = response.json()
    completion = response_json["choices"][0]["message"]["content"]
    return completion
Here's what's going on in this code:
  1. First, we create a messages array to send to the LLM as part of our request. Since we’re using the ChatCompletions API, we’ll need to treat this as if we’re sending a list of chat conversations even though we only have one question we need answered. This becomes important when we want to provide the entire chat history to the LLM (this will be covered in later tutorials).
  2. Then we send this payload, including the messages of the constructed prompt, to the ChatCompletions API from OpenAI.
  3. We get back a text response from the LLM with the answer to the question.
That's it! Putting it all together, here’s our completeroutes.py file:

# Updated "YourApp/api/routes.py"
  
from. import api_blueprint
from flask import request, jsonify
from app.services import openai_service, pinecone_service, scraping_service
from app.utils.helper_functions import chunk_text, build_prompt

PINECONE_INDEX_NAME = 'index237'

@api_blueprint.route('/handle-query', methods=['POST'])
def handle_query():
    question = request.json['question']
    context_chunks = pinecone_service.get_most_similar_chunks_for_query(question, PINECONE_INDEX_NAME)
    prompt = build_prompt(question, context_chunks)
    answer = openai_service.get_llm_answer(prompt)
    return jsonify({ "question": question, "answer": answer })

@api_blueprint.route('/embed-and-store', methods=['POST'])
def embed_and_store():
    url = request.json['url']
    url_text = scraping_service.scrape_website(url)
    chunks = chunk_text(url_text)
    pinecone_service.embed_chunks_and_upload_to_pinecone(chunks, PINECONE_INDEX_NAME)
    response_json = {
        "message": "Chunks embedded and stored successfully"
    }
    return jsonify(response_json)
Test the implementation using cURL

To make sure everything is working properly, let’s start the server locally and test that these endpoints work properly. We'll be using a sample URL to test out the application (https://dev.to/bobur/how-to-build-a-custom-gpt-enabled-full-stack-app-for-real-time-data-38k8)

In a terminal window at the root of your application directory, run the python run.py command to start the server.

Now let's test the embed_and_store endpoint. Open a new terminal window and perform the following command:

        curl -X POST http://localhost:5000/embed-and-store\
        -H "Content-Type: application/json" \
        -d '{"url":"https://dev.to/bobur/how-to-build-a-custom-gpt-enabled-full-stack-app-for-real-time-data-38k8"} '


        # ...after a few seconds, you should see this in the terminal:
        {
          "message": "Chunks embedded and stored successfully"
        }
We have our vector store populated! Now let’s try asking a question using the handle-query endpoint:

        curl -X POST http://localhost:5000/handle-query\
        -H "Content-Type: application/json" \
        -d '{"question":"Why do we provide ChatGPT with a custom knowledge base?"}'


          # ...after a few seconds, you should see this:
          {
            "answer": "To enhance its ability to deliver more accurate and context-specific information in its responses, which is particularly useful for specialized applications such as the sample app designed for finding real-time discount prices.",
            "question": "Why do we provide ChatGPT with a custom knowledge base?"
          }
Conclusion and what's next

Our backend works! So far in this tutorial, we've implemented the following:
  • Backend (Flask): We set up the entire backend for the web application. This handles the tasks below.
  • Scrape, embed, and store vectors: We used OpenAI’s Embeddings API to embed the contents of a given website (in chunks) and store them in the Pinecone vector database.
  • RAG: When the user asks a question, our backend finds the most relevant context chunks from the vector database, builds a prompt, and gets an answer from an LLM (OpenAI’s GPT-4).
You can find the source code for this tutorial at this branch on GitHub. In the next post, we'll build the React frontend to interact with the backend we've built here. We’ll also learn how to stream responses from the LLM to the user and incorporate the entire chat history to send to the LLM.

2 comments

  1. Really useful information. I also highly recommend HireFullStackDeveloperIndia for Python development. I am Absolutely impressed with the expertise and innovation of the team. They're undoubtedly the best choice to Hire Python Developers in India!
    1. That;s a great