How is KSQL different from SQL

Apollo Software Labs
3 min readSep 30, 2022

--

Although on the surface KSQL looks similar, there are many differences to be aware of.

To understand what is involved in joining data from multiple topics, Checkout https://medium.com/@krishna-thotakura/joining-data-in-real-time-in-kafka-using-ksql-c9e048b2d20c

KSQL is now KSQLDB. I will use KSQL hereafter because it is shorter.

KSQL Table
KSQL Table has distinct keys, but when you look at the Kafka Topic created as a result of the KSQL Table, you will see changelog where you may have multiple rows with same primary key. But when you query the KSQL Table using a SELECT statement, only the latest record with the given primary key is returned. i.e. KSQL Table is a Materialized View of the SELECT statement executed to create the KSQL Table.

Creating Table and Streams
CREATE STREAM AS SELECT and CREATE TABLE AS SELECT statements create Kafka Topics. But CREATE STREAM and CREATE TABLE statements do not create Kafka Topics.

Table-Table Join — The Right Side
The right side Table can only be joined on its primary key. Almost always, you will be joining primary key of one table with a foreign(non-primary) key on another table. Make sure you write the join such that the table with foreign key column is on right side.

Offset
By default, KSQL, just like any other Kafka client, only shows latest data from Kafka Topic. i.e. Data that your KSQL client session has not seen. If you want to see data from the beginning of the Kafka Topic:
SET ‘auto.offset.reset’ = ‘earliest’;

Push-Pull
When executing SELECT statement, note that only Push Queries(those that end with EMIT CHANGES clause) allow GROUP BY clause. But when executing a CREATE TABLE AS SELECT statement, you can have the SELECT statement with GROUP BY and OMIT the EMIT CHANGES clause.

DROP TABLE
By Default, When you delete a table using DROP TABLE statement, the underlying Kafka Topic that was created continues to exist. i.e. You are basically dropping the materialized view. As a consequence, if you DROP TABLE and re-execute CREATE TABLE AS SELECT with same Table and Topic name, you will have some data based on older logic and some data based on newer logic.

Aggregation Functions
Functions such as LATEST_BY_OFFSET cannot be applied to KSQL Tables. They can only be applied to KSQL Streams.

If you are performing a GROUP BY on a non-key column in a Table, you can apply function such as COLLECT_LIST but not COLLECT_SET or LATEST_BY_OFFSET.

Although COUNT and SUM functions are supported in KSQL Tables, MIN, MAX functions cannot also be applied to KSQL Tables due to inefficient re-compute costs. However, You can workaround by ARRAY_MIN(COLLECT_LIST(value))

Cannot Join
Stream-Table joins are supported. Table-Stream joins are not.
Table-Table joins are not supported as part of n-way joins. i.e. You can join 2 tables at a time, but not 3 at a time. You would have to perform this over 2 operations joining the 1st two tables and then joining the 3rd table.

GROUP BY
You cannot create a KSQL Stream with GROUP BY. However, you can create a KSQL Table with GROUP BY expression and this automatically becomes the table’s primary key.

Joining Streams
Stream-Stream joins must have a WITHIN clause. Only when events occur on both sides within specified window, the join will be applied.

Rowkey
When you create a stream out a topic, each record in the stream will have a ROWKEY and few other attributes such as original offset of the record in the topic.

Stream KEY
When creating a Stream, if you dont specify KEY, the stream column is extracted from the Topic Message Content. When KEY is specified, the column is extracted from the Topic Message Key.

Unwrap
When dealing with AVRO keys, especially when the source of your data is Debezium, make sure you unwrap the data to allow for joins. Use STRUCT to read the wrapped value and derefernce using -> operator.

Join Data
To join data from 4 topics, you would expect SQL that would look like

select u.usr_id, u.email, uf.fnctn_cd, l.org_nm
from usr u
inner join usr_fnctn uf on u.usr_id = uf.usr_id
inner join fnctn_lctn ufl on ufl.fnctn_id = uf.fnctn_id
inner join org.org l on ufl.loctn_id = l.loctn_id;

But checkout the KSQL implementation of the same — https://medium.com/@krishna-thotakura/joining-data-in-real-time-in-kafka-using-ksql-c9e048b2d20c

--

--

Apollo Software Labs
Apollo Software Labs

Written by Apollo Software Labs

Hands-on Solution Architect passionate about building secure, scalable, and high performance solutions in the cloud.

No responses yet