Table of Contents |
---|
Welcome to Chango Cloud.
Chango is a SQL Data Lakehouse Platform based on Trino which is a popular open source query engine. All the users who have experienced trino can use chango easily.
Let’s get started with Chango.
Chango Architecture
...
Chango has a common architecture like any other data platform based on lambda architectue. Data flows from left to right in chango.
External csv, json and excel can be inserted into chango through chango data api.
External data sources like mongodb, hive tables, redshift can be joined and inserted to chango.
External streaming application can insert a lot of incoming events to chango.
All the data will be saved as iceberg table in chango.
Trino clients like Redash, Metabase, and python trino client can access iceberg tables in chango through trino gateway which routes queries to backend upstream trino clusters.
Understand Trino Gateway Concept
Chango introduces the concept of trino gateway which routes trino queries to upstream backend trino clusters dynamically. If one of the backend trino clusters has exhausted, then trino gateway will route queries to the trino cluster which is executing less requested queries.
Trino does not support HA because trino coordinator has single point failure. In order to support HA of trino, we need to use trino gateway.
Let’s say, there is only one large trino clusters(100 ~ 1000 workers) in the company. Many people like BI experts, data scientists, and data engineers are running trino queries on this large cluster intensively. These trino queries can be interactive or ETL. For example, because long running ETL queries have occupied most of the resources which trino cluster needs to execute another queries, there can be little resources remaining trino cluster can use to execute another interactive queries. The people who are running the interactive queries need to wait until the long running ETL queries will finish. Such conflict problems can also happen in reverse case.
Such monolithic approach with large trino workers can lead to be problematic. We need to separate a large trino cluster to small trino clusters for the groups like BI team, data scientist team, and data engineer team individually.
Let’s say, BI team has 3 backeend trino clusters. If one of trino clusters needs to be scaled out, and trino catalogs needs to be added to this trino cluster for new external data sources, or the trino cluster needs to be reinstalled with new trino version, then first just deactivate this trino cluster to which the queries will not be routed without down time problem of trino query execution. After finishing scaling workers ,updating catalogs or reinstalling the trino cluster, activate this trino cluster to which the queries will be routed again. With trino gateway, the activation and deactivation of backend trino clusters can be done with ease.
Chango supports all the functionalities of trino gateway mentioned above.
Intialize Cluster
The first step to use chango is to create the initial cluster. Go to Clusters
→ Create Initial Cluster
.
Trino Cluster Group Name
is the name of the logical cluster group which will be created for the specific team or organization.Trino Cluster Name
is the name of the trino cluster which belongs to the cluster group with the name ofTrino Cluster Group Name
. The trino cluster with the name ofTrino Cluster Name
will be created.Trino Node Memory in GB
is thereference value
to calculate the memory of a kubernetes node on which trino will be installed. Let’s say, ifTrino Node Memory in GB
is 16, then the actual memory of kubernetes node provisioned will be 16/0.8, that is, 20GB. According to thisreference value
, the memory related trino configurations also will be calculated and set.Trino Node CPU
is the cpu count of kubernetes node on which trino will be installed.Trino Worker
is the trino worker count.Trino Version
is the version of trino which will be installed. Currently, there are just403
and406
version in the selectbox, but another versions also will be supported by chango(actually, there is no reason not to support another versions, because chango uses the released official trino docker image). That is, you can reinstall trino cluster with another version everytime you want.
After clicking the button of Create Cluster
, the kubernetes cluster will be provisioned on which the intial trino cluster will be installed with additional chango components and trino gateway. It take a little bit time, about an hour.
Administrate Cluster Group
Overview of Cluster Group
After success of initializing cluster, you need to go to Clusters
→ Cluster Group
to administrate all the trino clusters and trino gateway.
Take a note that a cluster group can have many trino gateway users and many trino clusters. That is, all the trino gateway users can connect to and send queries to trino gateway which will route the queries to backend trino clusters which belong to this cluster group.
After installing initial cluster, trino gateway user trino
has been created automatically which has the password with the default value of trino123
. You need to chango this default password to another one with clicking the button of Update Password
.
In Trino Clusters
part, the details of installed trino cluster can be seen.
Trino Gateway Routable
with green circle means, trino gateway will route trino queries to this trino cluster. Worker
is the count of trino workers which can be scaled with clicking the button of Scale
.
Pods has been listed with Host IP
which is the private address of kubernetes node on which trino pod is running. As mentioned before, Node Memory
is the value calculated with respect to reference value
, that is, Trino Node Memory in GB
which is entered when initialization of cluster has submitted.
Scale Trino Workers
To scale out / in trino workers, use the button of Scale
with adjusting worker count.
Let’s scale out trino workers from 3 to 5. It looks like this.
...
It will provision 2 new kubernetes nodes and then trino workers will be deployed on that new kubernetes nodes. For scaling in workers, first replicas of trino workers will be decreased and then the kubernetes nodes on which trino workers are not running will be removed.
After success of scaling out workers, it looks like the following picture.
...
The below picture show the scaling process to scale in workers from 5 to 1.
...
After success of scaling in workers from 5 to 1, it looks like this.
...
Update Memory Properties
Memory properties in trino cluster have been configured by chango automatically. But you can configure them to suit your needs. To update memory properties like max memory per node, max memory and total memory in trino cluster, click the button Update
in Memory Properties in GB
in trino cluster.
Chango provides updating the following memory properties currently.
query.max-memory-per-node
memory.heap-headroom-per-node
query.max-memory
query.max-total-memory
Take a look at the value 12
of Max Heap Size
in the above picture. This Max Heap Size
value with which trino pods are running is already determined by chango. The sum of query.max-memory-per-node
and memory.heap-headroom-per-node
must be less than Max Heap Size
.
Create New Trino Cluster
Let’s create new trino cluster in cluster group of bi
for our example. Go to Clusters
→ Create New Cluster
.
...
Trino Cluster Name
is etl
with selecting cluster group of bi
. Submit by clicking Create Cluster
button.
You can see the progress message in the part of Trino Clusters
of page Cluster Group
.
...
It takes about 15 minutes to create new trino cluster. Chango will provision new kubernetes nodes, 2 new kubernetes nodes in this case, on which new trino cluster will be installed.
The list of trino clusters will be shown after success of creating new cluster. Cluster etl
has been created successfully as follows.
...
Activate and Deactivate Trino Clusters in Trino Gateway
To deactivate trino cluster, click the buttons of Deactivate
in Trino Gateway Routable
with green circle, then trino gateway will not route queries to this trino cluster.
Green circle will be changed to yellow circle in Trino Gateway Routable
label.
...
To activate the trino cluster, just click the button of Activate
with yellow circle, then trino gateway will route queries to this trino cluster again.
...
You have 2 trino clusters now to which trino gateway can route the queries.
If you want to scale out workers in one of the clusters, just deactivate that cluster. Now, even if one of the clusters has been deactivated from trino gateway, the other activated trino cluster can handle the queries routed from trino gateway. After scaling out workers of the deactivated cluster, just activate this deactivated cluster again, then all the clusters will execute the queries routed by trino gateway without downtime.
Update User Password of Trino Gateway
After cluster initialized, the default user trino
with default password trino123
has been created by chango automatically. You need to update the password of the user trino
.
Create New Cluster Group
You can create another cluster groups for the teams like data-scientist
. After creating cluster group, new trino gateway users and trino clusters which belong to this cluster group can be created. As mentioned above, all the new created trino gateway users can connect to trino gateway and run queries to trino gateway which will route the queries only to the new created trino clusters.
...
This separation of cluster group and trino clusters which trino gateway routes the queries only to has an advantage to avoid from conflict problem of monolithic large trino cluster.
Destroy Trino Cluster
You can destroy trino cluster everytime you want with clicking the link of Destroy Trino Cluster
. After destroying trino cluster, the destroyed trino cluster will be deactivated and deregistered from trino gateway automatically.
Chango Services
After success of initializing cluster, there are several chango components installed which can be accessed publicly. You can see the chango service urls in Services
→ Services
.
Admin URL
is chango admin server to which user needs to be logged in in order to get access token.Data API URL for JSON
is chango data api server for json used to ingest json data to chango.Data API URL for Excel
is chango data api server for excel used to ingest excel data to chango.Trino Gateway URL
is the endpoint of trino gateway to which trino clients like Redash, Trino CLI, Metabase can connect to run queries.Redash
is the BI tool, url of Redash installed by chango.Metabase
is the BI tool, url of Metabase installed by chango.
Info |
---|
|
BI Tools
Chango provides popular open source BI tools like Redash and Metabase.
Redash
To connect to trino in Redash, trino data source needs to be added like this. In Host
field, the host name of trino gateway url needs to be entered.
Trino Gateway URL convention is.
Code Block |
---|
https://chango-trino-gateway-oci-<user>.cloudchef-labs.com |
<user>
is the chango user name.
...
To run queries in Redash, let’s create iceberg schema and table.
Code Block |
---|
-- create schema.
CREATE SCHEMA IF NOT EXISTS iceberg.iceberg_db |
Code Block |
---|
-- ctas.
CREATE TABLE IF NOT EXISTS iceberg.iceberg_db.test_ctas
AS
SELECT
*
FROM tpch.sf1000.lineitem limit 100000 |
With CTAS query, new iceberg table test_ctas
has been created, and data from the table tpch.sf1000.lineitem
has been inserted to this iceberg table.
Run select query in iceberg table.
Code Block |
---|
select * from iceberg.iceberg_db.test_ctas limit 1000 |
The following picture shows the result of the query run by Redash.
...
Metabase
As seen in Redash configuration, in Metabase, Starburst
database needs to be added to connect to trino gateway.
...
Let’s run the query in iceberg table. The following picture shows the result of the query run by Metabase.
...
Trino Catalogs
There are many connectors supported by trino to access external data sources. By adding trino catalogs, external data sources can be accessed by trino, and external data can be ingested to chango.
Create / Update / Delete Catalogs
Go to Clusers
→ Trino Catalogs
to create, update and trino catalogs in trino cluster.
The following shows the reserved names of catalogs already created by chango. You need to use the other names of catalogs instead of the reserved catalog names when creating catalogs.
...
Let’s create new trino catalog with the name of mysql
.
...
After clicking the button of Create Catalog
, the following message will be shown.
...
After success of creating catalog mysql
, the catalog mysql
will be listed with properties value.
...
You can update the catalog by clicking the button of Upate Catalog
.
...
Take a note that we have 2 trino clusters to which trino gateway will route the queries. In order to use mysql
catalog, you need to create the same catalog in the other trino cluster. Let’s create catalog mysql
in the other cluster.
After creating catalog mysql
for the other cluster, you will see the catalog list in trino clusters section.
...
Now, you are ready to use mysql
catalog through trino gateway. If you have plan to remove or update catalogs from trino clusters, use the functionality of activation / deactivation of trino clusters in trino gateway to avoid from query execution failure in trino clusters.
To remove catalogs in trino cluster, just click the button of Remove
.
Open Table Format Catalog
There are open table formats supported by trino, namely, Iceberg, Delta Lake and Hudi.
Because Iceberg catalog iceberg
has already been created by chango when creating trino cluster, you don’t have to create iceberg catalog in chango. But if you have external hive metastore from which all table metadata for external iceberg tables needs to be retrieved, then another iceberg catalog which connects to this external hive metastore can be added to chango. In this case, you have to choose another name for your iceberg catalog which is not the same to iceberg
because iceberg
catalog name already exists in trino clusters in chango.
Because iceberg table format is most popular table format, the default table format in chango is iceberg, and if you want to insert data to chango, all your data will be saved as iceberg table in chango.
Scale Data Ingestion
Ingestion group is a set of data ingestion components like chango data api, kafka cluster and spark streaming job provided by chango to collect incoming external data like csv, json and excel and save them to chango.
As your incoming data volume is increased, you may consider scaling out ingestion groups to handle more incoming data.
To scale data ingestion, go to Data Ingestion
→ Scale Ingestion Group
in the menu.
Create Ingestion Groups
For example, creating 2 ingestion groups with 3 node count for each ingestion group means, each ingestion group will be deployed on 3 new created nodes because of 3 node count, so 2 ingestion groups will be deployed on 6 new created nodes.
To create ingestion groups, click Create Ingestion Groups
.
...
As seen in the above picture, 3 nodes with the capacity of 4 CPU and 18GB memory for every ingestion group will be created. The data ingestion components like chango data api, kafka, spark streaming job, etc for an ingestion group will be deployed on that 3 new created nodes. Because Group Count
is 1, just one ingestion group will be installed. If Group Count
were 2, then two set of data ingestion components would be deployed on 6 new created nodes.
After creating ingestion groups finished, you will see the ingestion groups installed.
...
The details about installed data ingestion components like chango data api, nginx, kafka cluster and spark streaming job for every ingestion group also will be listed.
...
Scale Ingestion Groups
If you want to increase or decrease ingestion groups, then adjust the value of Group Count
and click the button of Scale
.
...
Delete Ingestion Groups
You can also delete ingestion groups created. Click Delete Ingestion Groups
.
...
Even if you have deleted ingestion groups, the default ingestion group will remain to keep handling incoming data.
Data Ingestion in Chango
There are three ways to insert external data to chango.
Insert local csv, json and excel data and csv, json data located on S3 to chango using chango client.
Insert streaming json data to chango using chango client library.
Insert external data sources to chango using trino catalogs.
Info |
---|
Before you go ahead, you have to Create Ingestion Groups. |
Insert Local and S3 Data to Chango using Chango Client
Chango provides chango client to insert local csv, json and excel data to chango. These data will be saved as iceberg table in chango and can be explored by trino clients like Redash and Metabase provided by chango.
The chango client is a CLI written in Java. So Java 11 needs to be installed beforehand to use it.
Please see the details about ingestion of csv, json and excel to chango using chango client here: Chango CLI
Upload Excel to Chango
Instead of using Chango CLI to insert excel data to chango, you can also use Upload Excel
page in Data Ingestion
menu like below picture.
...
After uploading excel file, you can run queries for uploaded excel data in chango using BI tools like redash and metabase provided by chango or any other tools which can connect to chango trino gateway. For example, run the following query to explore above uploaded excel data.
Code Block |
---|
select * from iceberg.excel_db.tbl_excel |
Insert Streaming Data to Chango using Chango Client Library
External applications like streaming applications can send streaming json data to chango using chango client library written in java. External applications just need to call add(json)
method of ChangoClient
instance, then ChangoClient
instance will queue the incoming json list internally and send these queued json list to chango in batch mode, for instance, send 100 json rows in gzip format to chango at once.
Please see the details how to use chango client library for streaming applications here: Chango Client API Library
Insert External Data Source to Chango using Trino Catalogs
Trino supports many connectors to join external data sources. To do so, trino catalogs needs to be added to trino. After adding trino catalogs for external data sources, you can query external data sources. To ingest external data sources to chango, you can use the queries like CTAS
or INSERT INTO [TABLE] SELECT
. There are many resources how to use trino queries out there.
Monitoring
Monitor Cluster
You can monitor all the chango components resources with grafana provided by chango. Go to Monitoring
→ Monitor Cluster
.
...
Billing
Usage Cost
To see the chango usage cost, go to Billing
→ Usage Cost
.
Here, you can see the chango usage cost for now. To see the previous usage history, click the input box to select desired month.
...
You can estimate monthly settlement easily.
Settings
Delete Chango Cluster
If you want to delete your chango cluster, go to Setting
→ Delete Chango Cluster
.
...
Take a note that all the chango resources like trino clusters, trino gateway, operators and chango data api, etc will be deleted. You can initialize chango cluster again whenever you want later. If you want to do so, go to Clusters
→ Create Initial Cluster
.
The default option of Delete All Lakehouse Data
is No
, which means, even if you delete the chango cluster, your lakehouse data in chango will not be deleted. Next time you initialize chango cluster again, you can explore your lakehouse data in chango again. But meta data of trino gateway users and trino cluster groups will be lost because such meta data depends on the current trino clusters in chango cluster.
If you have checked the option of Delete All Lakehouse Data
as Yes
in the checkbox, all your lakehouse data in chango will be lost. It is recommended that your lakehouse data in chango needs to be backed up in another place beforehand
Info |
---|
Chango Cloud Document Site was moved to https://cloudcheflabs.github.io/chango-cloud-docs . |