How to Build an AI-Powered Knowledge Base for Your Team in 48 Hours
Last month, our 35-person product team was drowning in documentation. We had wikis, Google Docs, Confluence pages, Notion databases, Slack threads with critical decisions buried in them, and a shared drive with hundreds of PDFs nobody could find anything in. Sound familiar?
I spent a weekend building an AI-powered knowledge base that lets anyone on the team ask questions in plain English and get accurate answers sourced from all of our documentation. Total build time: about 48 hours. Monthly running cost: $153. The team went from "I can't find that document" to "just ask the knowledge base" in less than a week.
Here's exactly how I built it, step by step, with time estimates so you can plan your own build.
What We're Building
The architecture is called RAG — Retrieval-Augmented Generation. Instead of asking an AI model to know everything (it doesn't), we feed it our own documents and let it search through them to answer questions. Think of it as giving ChatGPT access to your company's brain.
The stack:
- Python 3.11+ — the backbone
- LangChain — orchestration framework for connecting LLMs to data
- ChromaDB — vector database for storing document embeddings
- OpenAI Embeddings (text-embedding-3-small) — converts text to searchable vectors
- OpenAI GPT-4o-mini — generates answers from retrieved context
- Streamlit — quick and clean web interface
By the end of this tutorial, you'll have a working system where users type a question, the system finds the most relevant document chunks, and an LLM synthesizes an answer with source citations.
Hour 0-2: Environment Setup and Dependencies
First, let's get the project scaffolded. Create a new directory and set up a virtual environment:
mkdir team-knowledge-base
cd team-knowledge-base
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install langchain langchain-openai langchain-community
pip install chromadb
pip install streamlit
pip install python-dotenv
pip install pypdf docx2txt unstructured
pip install tiktoken
Create a .env file for your API key:
OPENAI_API_KEY=sk-your-key-here
And set up the project structure:
team-knowledge-base/
├── .env
├── app.py # Streamlit frontend
├── ingest.py # Document processing pipeline
├── query_engine.py # RAG query logic
├── config.py # Shared configuration
├── documents/ # Drop your docs here
│ ├── pdfs/
│ ├── markdown/
│ └── text/
└── chroma_db/ # Vector database storage
Time check: this should take about 30 minutes if you're familiar with Python environments, up to 2 hours if you need to install Python itself and get API keys set up.
Hour 2-8: Document Ingestion Pipeline
This is the most critical part of the entire system. How you chunk your documents determines how good your answers will be. I learned this the hard way — my first attempt used naive 1000-character chunks and the answers were garbage.
The Chunking Strategy That Actually Works
After testing five different chunking approaches, here's what worked best for our mixed-format documentation:
# config.py
CHUNK_SIZE = 1500 # characters per chunk
CHUNK_OVERLAP = 200 # overlap between chunks
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"
COLLECTION_NAME = "team_knowledge"
CHROMA_DIR = "./chroma_db"
Why 1500 characters with 200 overlap? Through testing, I found that:
- Too small (500 chars): Answers lack context. The model gets a sentence fragment and can't synthesize a useful response.
- Too large (3000+ chars): Retrieval precision drops. When a chunk covers multiple topics, it gets pulled for queries where only part of it is relevant, adding noise to the context.
- 1500 chars: Sweet spot for typical documentation paragraphs. Most complete ideas fit within this window.
- 200 char overlap: Prevents ideas that span chunk boundaries from being lost. Critical for numbered lists and step-by-step instructions.
Now here's the ingestion script:
# ingest.py
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import (
PyPDFLoader,
TextLoader,
Docx2txtLoader,
UnstructuredMarkdownLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from config import *
load_dotenv()
def load_documents(docs_dir="./documents"):
'Load all supported document types from the directory tree.'
documents = []
loaders_map = {
".pdf": PyPDFLoader,
".txt": TextLoader,
".docx": Docx2txtLoader,
".md": UnstructuredMarkdownLoader,
}
for root, dirs, files in os.walk(docs_dir):
for file in files:
ext = os.path.splitext(file)[1].lower()
if ext in loaders_map:
file_path = os.path.join(root, file)
try:
loader = loaders_map[ext](file_path)
docs = loader.load()
# Add source metadata
for doc in docs:
doc.metadata["source"] = file_path
doc.metadata["filename"] = file
documents.extend(docs)
print(f" Loaded: {file} ({len(docs)} pages/sections)")
except Exception as e:
print(f" Error loading {file}: {e}")
return documents
def chunk_documents(documents):
'Split documents into overlapping chunks.'
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(documents)
print(f" Split {len(documents)} documents into {len(chunks)} chunks")
return chunks
def create_vector_store(chunks):
'Embed chunks and store in ChromaDB.'
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_name=COLLECTION_NAME,
persist_directory=CHROMA_DIR,
)
print(f" Vector store created with {len(chunks)} embeddings")
return vector_store
if __name__ == "__main__":
print("Step 1: Loading documents...")
docs = load_documents()
print(f" Total documents loaded: {len(docs)}")
print("Step 2: Chunking documents...")
chunks = chunk_documents(docs)
print("Step 3: Creating vector store...")
store = create_vector_store(chunks)
print("Done! Knowledge base is ready.")
The RecursiveCharacterTextSplitter is key here. It tries to split on paragraph breaks first (\n\n), then line breaks, then sentences, then words. This means your chunks will almost always contain complete thoughts rather than cutting off mid-sentence.
Handling Different Document Types
A quick note on document types, because this tripped me up:
- PDFs: PyPDF works for most documents, but scanned PDFs need OCR. If you have scanned docs, add
pip install pytesseractand useUnstructuredPDFLoaderinstead. - Google Docs: Export as .docx first. The Google Drive API can automate this, but for the MVP, manual export is fine.
- Confluence: Export spaces as HTML, then use
UnstructuredHTMLLoader. It handles nested formatting surprisingly well. - Slack threads: This is the hardest one. I wrote a small script that uses the Slack API to export bookmarked threads as text files. That's a separate tutorial, but the key is to include the thread context (channel name, participants, date) as metadata.
Hour 8-16: Query Engine
Now the fun part — making the knowledge base actually answer questions.
# query_engine.py
import os
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from config import *
load_dotenv()
# Custom prompt that enforces source citation
PROMPT_TEMPLATE = (
"You are a helpful assistant for a product team.\n"
"Use the following context to answer the question.\n"
"If you do not know the answer based on the context, say so honestly.\n"
"Always cite which document(s) you are drawing from.\n\n"
"Context:\n{context}\n\n"
"Question: {question}\n\n"
"Answer (cite your sources):"
)
def get_query_engine():
'Initialize and return the RAG query engine.'
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)
vector_store = Chroma(
collection_name=COLLECTION_NAME,
persist_directory=CHROMA_DIR,
embedding_function=embeddings,
)
retriever = vector_store.as_retriever(
search_type="mmr", # Maximum Marginal Relevance
search_kwargs={
"k": 6, # Return top 6 chunks
"fetch_k": 20, # Consider top 20 before MMR filtering
"lambda_mult": 0.7, # Diversity vs relevance balance
}
)
llm = ChatOpenAI(model=LLM_MODEL, temperature=0.1)
prompt = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt}
)
return qa_chain
def query(question):
'Query the knowledge base and return answer + sources.'
engine = get_query_engine()
result = engine.invoke({"query": question})
sources = set()
for doc in result.get("source_documents", []):
sources.add(doc.metadata.get("filename", "Unknown"))
return {
"answer": result["result"],
"sources": list(sources),
}
Why MMR Instead of Simple Similarity Search
I want to highlight the search_type="mmr" setting because it makes a huge difference. Standard similarity search returns the 6 most similar chunks to your query — but those 6 chunks might all be from the same document saying roughly the same thing. That's redundant and wastes your context window.
MMR (Maximum Marginal Relevance) balances relevance with diversity. It picks the first chunk based purely on similarity, then for each subsequent chunk, it penalizes chunks that are too similar to what's already been selected. The result is 6 chunks that are all relevant but cover different aspects of the topic.
The lambda_mult=0.7 controls this balance: 1.0 is pure similarity (no diversity), 0.0 is pure diversity (might not be relevant). I found 0.7 hits the sweet spot for documentation queries.
Hour 16-24: Streamlit Frontend
Time to make this usable by people who don't have a terminal open all day.
# app.py
import streamlit as st
from query_engine import query, get_query_engine
import time
st.set_page_config(
page_title="Team Knowledge Base",
page_icon="search",
layout="wide"
)
st.title("Team Knowledge Base")
st.markdown("Ask anything about our documentation, processes, or decisions.")
# Initialize the engine once
if "engine" not in st.session_state:
with st.spinner("Loading knowledge base..."):
st.session_state.engine = get_query_engine()
# Chat history
if "messages" not in st.session_state:
st.session_state.messages = []
# Display chat history
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if msg.get("sources"):
with st.expander("Sources"):
for src in msg["sources"]:
st.markdown(f"- {src}")
# Input
if prompt := st.chat_input("What would you like to know?"):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.markdown(prompt)
with st.chat_message("assistant"):
with st.spinner("Searching knowledge base..."):
start = time.time()
result = query(prompt)
elapsed = time.time() - start
st.markdown(result["answer"])
st.caption(f"Response time: {elapsed:.1f}s")
if result["sources"]:
with st.expander("Sources"):
for src in result["sources"]:
st.markdown(f"- {src}")
st.session_state.messages.append({
"role": "assistant",
"content": result["answer"],
"sources": result["sources"],
})
# Sidebar
with st.sidebar:
st.header("About")
st.markdown(
"This knowledge base searches through all team "
"documentation to answer your questions."
)
st.markdown("---")
st.markdown("**Tips for better results:**")
st.markdown("- Be specific with your questions")
st.markdown("- Ask about one topic at a time")
st.markdown("- If the answer seems off, try rephrasing")
Run it with streamlit run app.py and you've got a working knowledge base with a chat interface.
Hour 24-36: Making It Actually Good
The basic version works, but here's what I added to make it production-worthy for a 35-person team:
1. Metadata Filtering
Not all documents are created equal. A product spec from last week is more relevant than one from two years ago. I added date-based metadata to the chunks and modified the retriever to optionally filter by date range or document category.
# In ingest.py, when loading documents:
doc.metadata["ingested_date"] = datetime.now().isoformat()
doc.metadata["category"] = determine_category(file_path) # Based on folder structure
2. Feedback Loop
I added thumbs up/thumbs down buttons to each response and logged them to a SQLite database. After a week, I had enough data to identify which documents were causing poor answers (usually because they were outdated or poorly structured) and which query patterns needed better handling.
3. Automatic Re-ingestion
I set up a cron job that runs the ingestion script nightly on a synced Google Drive folder. When someone updates a document, the knowledge base picks up the changes within 24 hours. For the shared drive, I used rclone to sync to the server before ingestion runs.
4. Access Control
For our team, this wasn't a big concern (everyone has access to all docs), but if you need it, Streamlit supports authentication via streamlit-authenticator or you can put it behind an OAuth proxy like oauth2-proxy.
Hour 36-48: Testing, Tuning, and Deployment
Testing Methodology
I prepared a test set of 50 questions that I knew the answers to. Then I categorized the knowledge base's responses:
- Correct and well-sourced: 38 out of 50 (76%)
- Partially correct (right idea, missing details): 7 out of 50 (14%)
- Incorrect: 2 out of 50 (4%)
- Correctly said "I don't know": 3 out of 50 (6%)
76% accuracy on the first try was good enough to ship. After tuning the chunk size, overlap, and prompt, I got it up to 84%. The remaining errors were mostly due to ambiguous questions or documents that contained contradictory information (which is a documentation problem, not an AI problem).
Deployment
I deployed on a small cloud VM (4GB RAM, 2 vCPUs — about $20/month). The setup:
- Streamlit app running behind nginx reverse proxy
- SSL via Let's Encrypt
- systemd service for auto-restart
- ChromaDB stored on disk (no separate database server needed)
Cost Breakdown
Here's what this costs to run for a 35-person team:
| Item | Monthly Cost |
|---|---|
| OpenAI API — Embeddings (re-ingestion + queries) | $23 |
| OpenAI API — GPT-4o-mini (query responses) | $85 |
| Cloud VM (4GB RAM) | $20 |
| Domain + SSL | $5 |
| Google Drive storage (shared docs) | $20 |
| Total | $153 |
That's about $4.37 per person per month. For comparison, commercial knowledge base AI tools like Guru AI or Glean cost $15-25 per user per month. We're saving roughly $370-720/month compared to off-the-shelf solutions, and we have full control over our data.
Lessons Learned (The Hard Way)
Chunk size matters more than model choice. I spent hours testing different LLMs when the real problem was my chunking strategy. Bad chunks in, bad answers out. No model can fix garbage retrieval.
Metadata is not optional. When a user asks "what's our refund policy?" and the knowledge base returns the refund policy from 2023 instead of the updated 2026 version, that's worse than returning nothing. Date metadata and source attribution are essential.
People ask questions differently than you expect. I assumed people would ask things like "What is the deployment process for service X?" Instead, they asked "how do I push code?" and "where does X run?" The prompt engineering needed to handle casual, vague questions was more work than I anticipated.
Start with 20 documents, not 2,000. I initially tried to ingest everything at once. The embedding costs spiked, the quality was inconsistent, and debugging was painful. Start small, validate the answers, then scale up.
The "I don't know" response is a feature. Train the model (via the system prompt) to say "I don't have information about that in the knowledge base" rather than guessing. Users trust the system more when it's honest about its limitations.
Common Issues and Fixes
"The answers are too generic." Increase the number of retrieved chunks from 4 to 6-8, and make sure your chunks are large enough to contain complete ideas. Also check if the relevant information is actually in your document set.
"It keeps citing the wrong document." Your embeddings might be stale. Re-run the ingestion pipeline. Also check if you have duplicate documents or different versions of the same doc in the system.
"It's slow." The main bottleneck is usually the LLM call, not the retrieval. Switch to GPT-4o-mini if you haven't already (it's faster and cheaper than GPT-4o with nearly the same quality for Q&A tasks). Also cache frequently asked questions.
"It hallucinates." Lower the temperature to 0.0-0.1 and add explicit instructions in the prompt: "Only answer based on the provided context. If the context doesn't contain the answer, say so." This reduces hallucination dramatically.
What's Next
Since deploying this three weeks ago, our team has asked over 1,200 questions. The most popular queries are about deployment processes, API documentation, and "who decided this and why" (which is why ingesting Slack decision threads was so valuable).
Next up, I'm planning to add: conversational memory (so follow-up questions work naturally), integration with Slack (ask questions directly in a channel), and automatic document staleness detection (flag documents that haven't been updated in 6+ months but are still being cited).
Building this took a weekend, but the time savings compound every day. If your team is spending more than 30 minutes a week searching for information, this pays for itself almost immediately.