SAP Datasphere + ChatGPT: Building a Self-Service Analytics Layer
The Problem Nobody Talks About in SAP Analytics
SAP has been promising self-service analytics for over a decade. From BW queries to Lumira to Analytics Cloud, every generation of tools has made the same promise: business users will be able to answer their own questions without waiting for IT. And every generation has fallen short for the same reason — the tools still require users to understand the data model.
Knowing that your sales revenue lives in fact table ACDOCA with a posting date filter in BUDAT, company code in RBUKRS, and profit center in PRCTR is not something a regional sales manager should need to know. They want to ask "what were our top 10 customers by revenue in Q1 2026 for the EMEA region?" and get an answer. Not a tutorial on how to build a query.
SAP Datasphere (the rebranded Data Warehouse Cloud) has actually made the data layer problem much better. Federation, replication, and the semantic business layer give you a unified view of SAP and non-SAP data. But the last mile — letting non-technical users query that data naturally — is still missing from the standard product.
That's the gap we're going to fill. We'll build a natural language interface on top of SAP Datasphere that uses OpenAI's GPT-4 (via SAP AI Core's Generative AI Hub, or directly) to translate plain English questions into SQL queries, execute them against Datasphere, and return formatted results. The entire system took three weeks to build and has been running in production since November 2025.
Architecture Overview
Here's what we're building:
- SAP Datasphere — the data platform, containing replicated and federated data from S/4HANA, BW/4HANA, and external sources
- A Python middleware service — handles the conversation, metadata lookup, and query generation
- OpenAI GPT-4 (or GPT-4o) — translates natural language to SQL
- A simple web frontend — chat-style interface where users type questions
- SAP Analytics Cloud (optional) — for users who want to visualize the results as charts
The trick that makes this actually work (as opposed to the dozens of "chat with your database" demos that fall apart on real data) is the semantic metadata layer. We don't just throw the raw table schema at GPT-4 and hope for the best. We build a curated metadata catalog that describes each table and column in business language, includes valid value examples, specifies join relationships, and provides query templates for common questions.
Setting Up SAP Datasphere for API Access
First, let's make sure your Datasphere tenant is accessible via API. You'll need:
- An SAP Datasphere tenant with the "Consumption API" capability enabled
- A database user with read access to your analytical models
- The ODBC/JDBC connection details from Datasphere's Database Access settings
In Datasphere, go to Space Management > your space > Database Access. Create a database user if you haven't already. Note the host, port, and schema name — you'll need these to connect from Python.
# Connection details from Datasphere Database Access
DSP_HOST = "your-tenant.hana.prod-eu10.hanacloud.ondemand.com"
DSP_PORT = 443
DSP_USER = "your_db_user"
DSP_PASSWORD = "your_password" # Use environment variables in production
DSP_SCHEMA = "your_space_schema"
For Python connectivity, we use the hdbcli library (SAP HANA client for Python):
from hdbcli import dbapi
def get_datasphere_connection():
conn = dbapi.connect(
address=DSP_HOST,
port=DSP_PORT,
user=DSP_USER,
password=DSP_PASSWORD,
encrypt=True,
sslValidateCertificate=True
)
return conn
def execute_query(query, params=None):
conn = get_datasphere_connection()
cursor = conn.cursor()
try:
cursor.execute(query, params or [])
columns = [desc[0] for desc in cursor.description]
rows = cursor.fetchall()
return columns, rows
finally:
cursor.close()
conn.close()
Building the Semantic Metadata Layer
This is the most important part of the entire system. Without good metadata, GPT-4 will generate syntactically correct SQL that returns wrong results. And wrong results that look right are far worse than an error message.
Create a metadata catalog file (I use YAML, but JSON works too):
# metadata/sales_analytics.yaml
tables:
- name: V_SALES_ORDERS
description: "Sales order data including header and item details, replicated from S/4HANA tables VBAK/VBAP"
business_context: "Contains all sales orders. Use this for revenue analysis, order volume tracking, and customer purchasing patterns."
columns:
- name: VBELN
description: "Sales order number"
business_name: "Order Number"
data_type: "NVARCHAR(10)"
example_values: ["0000012345", "0000067890"]
- name: ERDAT
description: "Date the order was created"
business_name: "Order Date"
data_type: "DATE"
notes: "Format YYYY-MM-DD. For quarterly analysis, use QUARTER(ERDAT)"
- name: KUNNR
description: "Customer number, foreign key to V_CUSTOMERS"
business_name: "Customer"
data_type: "NVARCHAR(10)"
join_to: "V_CUSTOMERS.KUNNR"
- name: NETWR
description: "Net order value in document currency"
business_name: "Order Value"
data_type: "DECIMAL(15,2)"
aggregation: "SUM for totals, AVG for average order value"
notes: "Always pair with WAERK (currency) when displaying. For consolidated reporting, use NETWR_LC (local currency) or NETWR_GC (group currency)"
- name: WAERK
description: "Document currency code"
business_name: "Currency"
data_type: "NVARCHAR(5)"
example_values: ["EUR", "USD", "GBP"]
- name: VKORG
description: "Sales organization"
business_name: "Sales Org"
data_type: "NVARCHAR(4)"
valid_values:
"1000": "EMEA"
"2000": "Americas"
"3000": "APAC"
- name: AUART
description: "Sales document type"
business_name: "Order Type"
data_type: "NVARCHAR(4)"
valid_values:
"TA": "Standard Order"
"KE": "Consignment Fill-up"
"RE": "Returns"
- name: NETWR_GC
description: "Net order value in group currency (EUR)"
business_name: "Order Value (EUR)"
data_type: "DECIMAL(15,2)"
aggregation: "SUM for totals. Use this for cross-region comparisons"
- name: V_CUSTOMERS
description: "Customer master data from KNA1/KNB1"
business_context: "Customer details including name, region, industry, and credit data"
columns:
- name: KUNNR
description: "Customer number"
business_name: "Customer Number"
data_type: "NVARCHAR(10)"
- name: NAME1
description: "Customer name"
business_name: "Customer Name"
data_type: "NVARCHAR(35)"
- name: LAND1
description: "Country code"
business_name: "Country"
data_type: "NVARCHAR(3)"
- name: BRSCH
description: "Industry sector code"
business_name: "Industry"
data_type: "NVARCHAR(4)"
valid_values:
"0001": "Manufacturing"
"0002": "Retail"
"0003": "Services"
"0004": "Technology"
common_queries:
- question: "top customers by revenue"
template: |
SELECT c.NAME1 as customer_name, SUM(s.NETWR_GC) as total_revenue
FROM V_SALES_ORDERS s JOIN V_CUSTOMERS c ON s.KUNNR = c.KUNNR
WHERE s.ERDAT BETWEEN '{start_date}' AND '{end_date}'
AND s.AUART = 'TA'
GROUP BY c.NAME1 ORDER BY total_revenue DESC LIMIT {limit}
- question: "revenue by region"
template: |
SELECT s.VKORG as sales_org, SUM(s.NETWR_GC) as revenue
FROM V_SALES_ORDERS s
WHERE s.ERDAT BETWEEN '{start_date}' AND '{end_date}'
AND s.AUART NOT IN ('RE')
GROUP BY s.VKORG ORDER BY revenue DESC
Notice how much context we're providing. Every column has a business name, example values, valid value mappings, and notes about how to use it correctly. The common_queries section gives GPT-4 templates for frequently asked questions, reducing the chance of errors.
The Query Generation Engine
Here's the core of the system — the module that takes a natural language question, combines it with metadata, and asks GPT-4 to generate SQL:
import openai
import yaml
import json
class QueryGenerator:
def __init__(self, metadata_path, openai_api_key):
with open(metadata_path) as f:
self.metadata = yaml.safe_load(f)
self.client = openai.OpenAI(api_key=openai_api_key)
self.conversation_history = []
def _build_system_prompt(self):
schema_description = []
for table in self.metadata['tables']:
cols = []
for col in table['columns']:
col_desc = f" - {col['name']} ({col['business_name']}): {col['description']}"
if 'valid_values' in col:
col_desc += f" Valid values: {json.dumps(col['valid_values'])}"
if 'notes' in col:
col_desc += f" Note: {col['notes']}"
cols.append(col_desc)
schema_description.append(
f"Table: {table['name']}\n"
f"Description: {table['description']}\n"
f"Business context: {table['business_context']}\n"
f"Columns:\n" + "\n".join(cols)
)
common_q = ""
if 'common_queries' in self.metadata:
for q in self.metadata['common_queries']:
common_q += f"\nExample - '{q['question']}':\n{q['template']}\n"
return f"""You are a SQL query generator for SAP Datasphere (HANA SQL syntax).
You translate natural language questions into SQL queries.
IMPORTANT RULES:
1. Only use tables and columns from the schema below. Never invent columns.
2. Use HANA SQL syntax (TOP is not supported, use LIMIT instead).
3. When comparing dates, use TO_DATE('YYYY-MM-DD') format.
4. Always exclude returns (AUART = 'RE') from revenue calculations unless asked.
5. Use NETWR_GC (group currency EUR) for cross-region comparisons.
6. When asked about "Q1", "Q2" etc., map to: Q1=Jan-Mar, Q2=Apr-Jun, Q3=Jul-Sep, Q4=Oct-Dec.
7. Return ONLY the SQL query, no explanation.
8. If the question cannot be answered with available tables, say "CANNOT_ANSWER:" followed by explanation.
SCHEMA:
{chr(10).join(schema_description)}
EXAMPLE QUERIES:
{common_q}"""
def generate_query(self, user_question):
if not self.conversation_history:
self.conversation_history.append({
"role": "system",
"content": self._build_system_prompt()
})
self.conversation_history.append({
"role": "user",
"content": user_question
})
response = self.client.chat.completions.create(
model="gpt-4",
messages=self.conversation_history,
temperature=0.1,
max_tokens=500
)
sql = response.choices[0].message.content.strip()
self.conversation_history.append({
"role": "assistant",
"content": sql
})
return sql
Query Validation and Safety
You absolutely cannot execute AI-generated SQL without validation. GPT-4 is good, but it occasionally generates queries that would scan entire tables, produce cartesian products, or (in adversarial cases) attempt data modification.
import sqlparse
class QueryValidator:
BLOCKED_KEYWORDS = ['INSERT', 'UPDATE', 'DELETE', 'DROP', 'ALTER',
'CREATE', 'TRUNCATE', 'MERGE', 'GRANT', 'REVOKE']
MAX_RESULT_ROWS = 10000
def __init__(self, allowed_tables):
self.allowed_tables = {t.upper() for t in allowed_tables}
def validate(self, sql):
errors = []
# Check for data modification attempts
parsed = sqlparse.parse(sql)
for statement in parsed:
stmt_type = statement.get_type()
if stmt_type and stmt_type.upper() != 'SELECT':
errors.append(f"Only SELECT statements are allowed, got {stmt_type}")
# Check for blocked keywords
sql_upper = sql.upper()
for keyword in self.BLOCKED_KEYWORDS:
if keyword in sql_upper.split():
errors.append(f"Blocked keyword found: {keyword}")
# Extract table names and validate
tables_used = self._extract_tables(sql)
unauthorized = tables_used - self.allowed_tables
if unauthorized:
errors.append(f"Unauthorized tables: {unauthorized}")
# Ensure LIMIT clause exists
if 'LIMIT' not in sql_upper:
sql = sql.rstrip(';') + f' LIMIT {self.MAX_RESULT_ROWS}'
return errors, sql
def _extract_tables(self, sql):
# Simple table extraction - works for common patterns
tables = set()
tokens = sql.upper().split()
for i, token in enumerate(tokens):
if token in ('FROM', 'JOIN'):
if i + 1 < len(tokens):
table_name = tokens[i + 1].strip('(),')
if table_name and not table_name.startswith('('):
tables.add(table_name)
return tables
The validator checks three things: no data modification statements, only whitelisted tables, and a mandatory row limit. The database user we created in Datasphere also has read-only access, so even if a malicious query somehow passes validation, the database itself won't allow modifications.
Building the Conversation Layer
A single-query system is useful, but real self-service analytics requires conversational context. Users want to ask a question, see the results, then ask a follow-up: "now break that down by quarter" or "exclude the government sector."
class AnalyticsAssistant:
def __init__(self, metadata_path, openai_key, datasphere_config):
self.generator = QueryGenerator(metadata_path, openai_key)
self.validator = QueryValidator(
allowed_tables=[t['name'] for t in self.generator.metadata['tables']]
)
self.dsp_config = datasphere_config
self.query_history = []
def ask(self, question):
# Generate SQL from natural language
sql = self.generator.generate_query(question)
# Check if the model said it can't answer
if sql.startswith('CANNOT_ANSWER:'):
return {
'status': 'cannot_answer',
'message': sql.replace('CANNOT_ANSWER:', '').strip(),
'sql': None,
'data': None
}
# Validate the generated SQL
errors, validated_sql = self.validator.validate(sql)
if errors:
return {
'status': 'validation_error',
'message': f"Query validation failed: {'; '.join(errors)}",
'sql': sql,
'data': None
}
# Execute against Datasphere
try:
columns, rows = execute_query(validated_sql)
self.query_history.append({
'question': question,
'sql': validated_sql,
'row_count': len(rows)
})
return {
'status': 'success',
'sql': validated_sql,
'columns': columns,
'data': rows,
'row_count': len(rows)
}
except Exception as e:
# If query fails, ask GPT-4 to fix it
fix_response = self.generator.generate_query(
f"The previous query failed with error: {str(e)}. "
f"Please fix the SQL query to answer the original question."
)
errors2, fixed_sql = self.validator.validate(fix_response)
if not errors2:
try:
columns, rows = execute_query(fixed_sql)
return {
'status': 'success',
'sql': fixed_sql,
'columns': columns,
'data': rows,
'row_count': len(rows),
'note': 'Query was auto-corrected after initial failure'
}
except Exception as e2:
pass
return {
'status': 'error',
'message': f"Query execution failed: {str(e)}",
'sql': validated_sql,
'data': None
}
Notice the auto-correction loop. When a generated query fails (usually due to a column name mismatch or syntax issue), the system sends the error back to GPT-4 and asks it to fix the query. In my experience, this auto-correction succeeds about 60% of the time, which means users see errors less frequently.
The Web Interface
The frontend is a simple Flask application with a chat-style interface. Nothing fancy — the value is in the backend, not the UI:
from flask import Flask, render_template, request, jsonify, session
import uuid
app = Flask(__name__)
app.secret_key = 'your-secret-key'
assistants = {} # session_id -> AnalyticsAssistant
@app.route('/ask', methods=['POST'])
def ask():
session_id = session.get('id')
if not session_id:
session_id = str(uuid.uuid4())
session['id'] = session_id
if session_id not in assistants:
assistants[session_id] = AnalyticsAssistant(
metadata_path='metadata/sales_analytics.yaml',
openai_key=OPENAI_API_KEY,
datasphere_config=DSP_CONFIG
)
question = request.json.get('question', '')
result = assistants[session_id].ask(question)
# Format the response
if result['status'] == 'success':
# Convert to HTML table
html_table = format_as_html_table(result['columns'], result['data'])
return jsonify({
'answer': html_table,
'sql': result['sql'],
'row_count': result['row_count'],
'status': 'success'
})
else:
return jsonify({
'answer': result['message'],
'sql': result.get('sql'),
'status': result['status']
})
Real-World Accuracy and Performance
After running this system with 34 active users across sales, finance, and supply chain departments for five months, here are the actual numbers:
| Metric | Value |
|---|---|
| Total questions asked | 4,847 |
| Successful queries (correct results) | 3,974 (82%) |
| Auto-corrected queries | 387 (8%) |
| Failed queries (user notified) | 486 (10%) |
| Average response time | 3.2 seconds |
| Questions per user per week | 7.1 |
| Previous IT tickets for ad-hoc reports (monthly) | 45 |
| Current IT tickets for ad-hoc reports (monthly) | 8 |
The 82% accuracy rate might sound low, but context matters. Before this system, those 34 users had to either build their own SAC stories (which most couldn't do) or submit IT tickets and wait 3-5 business days for a response. The 10% failure rate mostly comes from questions that reference data not yet available in Datasphere (HR data, for example) or questions that are genuinely ambiguous ("how are we doing?" is not a query any system can answer).
The Types of Questions That Work Best
- Aggregation queries: "What's the total revenue by sales org in Q1 2026?" — 95% accuracy
- Ranking queries: "Top 20 customers by order count this year" — 93% accuracy
- Comparison queries: "Compare Q1 2025 vs Q1 2026 revenue by region" — 88% accuracy
- Filtering queries: "Show all orders over €100K from German customers" — 91% accuracy
- Trend queries: "Monthly revenue trend for the past 12 months" — 85% accuracy
- Complex joins: "Revenue by customer industry for APAC region" — 78% accuracy
The Types That Struggle
- Calculations involving multiple steps: "Year-over-year growth rate by product category" — 62% accuracy (requires nested queries)
- Questions with ambiguous time references: "Recent orders" — 55% accuracy (how recent is recent?)
- Questions requiring business knowledge not in metadata: "Which customers are at risk of churning?" — 40% accuracy (the model doesn't know your churn definition)
Improving Accuracy Over Time
The metadata layer is the primary lever for improving accuracy. Every time a query fails or returns incorrect results, I analyze the failure and add information to the metadata:
- Add more example queries. When users consistently ask questions in a pattern the system struggles with, I add a template to the common_queries section. This improved year-over-year comparison accuracy from 62% to 84%.
- Expand valid value mappings. Users say "Germany" but the database stores "DE". Adding country name-to-code mappings in the metadata lets GPT-4 make the translation automatically.
- Add business rule annotations. "Revenue" should always exclude returns (AUART = 'RE') and internal transfers (AUART = 'ZIV'). Documenting this in the metadata means GPT-4 applies these filters automatically.
- Create view definitions for complex patterns. When a question requires a complex multi-join query that GPT-4 struggles with, create a Datasphere view that pre-joins the tables and add that view to the metadata as a single, simple table.
Cost Analysis
Let's talk about what this costs to run:
- OpenAI API costs: Average 4,800 queries/month x ~2,000 tokens per query x $0.03/1K tokens (GPT-4) = ~$288/month. Switching to GPT-4o drops this to ~$48/month with comparable accuracy for SQL generation.
- SAP Datasphere: Already paid for as part of the organization's SAP license. The incremental cost for the API user is negligible.
- Infrastructure: A small VM for the Python middleware (~$50/month) and hosting for the web frontend (~$20/month).
- Total monthly cost: ~$120/month with GPT-4o, ~$360/month with GPT-4.
Compare this to the cost of 45 IT tickets per month for ad-hoc reports. At an internal IT rate of $120/ticket (including analyst time, queue time, and review), that's $5,400/month in IT labor. The system saves over $5,000/month while giving users instant answers instead of a 3-5 day wait.
Security Considerations
Connecting an AI system to your SAP data raises legitimate security concerns. Here's how we handle them:
- Data doesn't go to OpenAI. Only the question and the metadata schema are sent to the API. The actual query results stay within your network. GPT-4 generates SQL; it never sees the data.
- Row-level security. The Datasphere views use analytic privileges to restrict data based on the logged-in user's authorization profile. A sales manager in EMEA only sees EMEA data, even if the generated SQL doesn't filter by region.
- Query audit logging. Every question, generated SQL, and result set size is logged with the user's identity and timestamp. This creates a full audit trail.
- Input sanitization. The SQL validator prevents injection attacks, and the database user has read-only permissions. Even if someone tries to be clever, the worst outcome is a failed query.
Integration with SAP Analytics Cloud
Some users prefer visual results over tables. For these users, we added an optional integration with SAP Analytics Cloud (SAC). When a query returns time-series or categorical data, the system can automatically generate a SAC story via the SAC API:
def push_to_sac(columns, data, chart_type='bar'):
# Create a dataset in SAC
sac_client = SACClient(
tenant_url=SAC_TENANT_URL,
client_id=SAC_CLIENT_ID,
client_secret=SAC_CLIENT_SECRET
)
# Upload results as a SAC dataset
dataset_id = sac_client.create_dataset(
name=f"NL_Query_{datetime.now().strftime('%Y%m%d_%H%M%S')}",
columns=columns,
data=data
)
# Generate a simple story with the chart
story_url = sac_client.create_story(
dataset_id=dataset_id,
chart_type=chart_type,
title="Analytics Query Result"
)
return story_url
This is the weakest part of the system, honestly. The SAC API for programmatic story creation is limited, and the results look generic. Most users end up preferring the HTML tables because they load instantly, while the SAC integration adds 8-10 seconds of latency. But for users who need to embed results in presentations, having the SAC option is valuable.
Lessons from Five Months in Production
The metadata is never done. I spend about 2 hours per week maintaining and expanding the metadata catalog. New tables get added to Datasphere, business rules change, users ask questions about data that wasn't previously mapped. This is ongoing operational work, not a one-time setup.
Users need training on asking good questions. "Show me everything" is not a useful query. We ran a 30-minute training session showing users how to ask specific, answerable questions. The failure rate dropped from 18% to 10% after training.
GPT-4 is better at HANA SQL than you'd expect. The model correctly handles HANA-specific functions like ADD_DAYS, DAYSBETWEEN, ISOWEEK, and QUARTER without special prompting. It also handles hierarchies and window functions well, though LAG/LEAD calculations sometimes produce syntax errors that the auto-correction catches.
Caching repeated questions saves money and time. About 30% of questions are variants of the same query (just with different date ranges or filters). A simple cache layer that recognizes semantic similarity reduced API costs by 25% and response time by 60% for cached queries.
Don't underestimate the political value. When a CFO can type "revenue vs budget by cost center for Q1" and get an instant answer during a board meeting, the credibility of the data team goes through the roof. Several of our users specifically mentioned this capability when advocating for the data team's budget during annual planning. The system pays for itself in organizational credibility alone.
What's Next
We're working on three improvements:
- Multi-turn query refinement with data preview. Show the first 5 rows of results and ask "is this what you expected?" before returning the full dataset. This catches wrong-column and wrong-filter issues early.
- Automated metadata generation. Using GPT-4 to read Datasphere table definitions and generate the initial metadata catalog, which is then reviewed and corrected by a human. This would cut the setup time for new tables from hours to minutes.
- Voice input via Microsoft Teams. Several users asked for the ability to ask questions via voice in Teams meetings. We're prototyping a Teams bot that accepts voice, transcribes via Whisper, generates the query, and returns results as a Teams card.
The core pattern — metadata-enriched natural language to SQL — is applicable far beyond SAP. Any structured data source with good metadata can be queried this way. But SAP is where the pattern delivers the most value because SAP data models are notoriously hard for non-technical users to navigate. When you remove that barrier, the demand for analytics doesn't just increase — it fundamentally changes who in the organization can make data-driven decisions.