AIR典型使用场景
创建一个普通的逻辑视图
第一步 :进入“视图”界面,如果首次进入,可能没有任何已有的数据空间,这时可以选择“新建数据空间”
输入“demo”,新建一个名为demo的数据空间,数据空间的名称必须输入英文名,便于编写SQL。
第二步: 如果是新建了一个数据空间,接下来还需要在该数据空间下再新建一个文件夹,点击“新建文件夹”,输入“demo”的文件夹,到此,创建视图的前期准备工作就完成了。到此我们已经创建了一个名为“demo”的数据空间和在该空间下建了一个叫“demo”的文件夹。
第三步: 创建一个视图之前,需先选择一个文件夹,点击“新建视图”
第四步: 输入以下SQL,可以点击查询进行数据查询预览。需要注意以下事项:
-
名称:视图名称必须在文件夹下唯一
-
查询:可试运行查看执行结果
-- 注意对应的数据源名和表名根据实际情况进行修改。如:gauss.tpcds
SELECT item.i_brand,ss_ext_sales_price,store_sales.ss_item_sk,item.i_item_sk,item.i_manufact_id
FROM gauss.tpcds.store_sales, gauss.tpcds.item
WHERE store_sales.ss_item_sk = item.i_item_sk
第五步: 视图创建成功后,即可在工作簿中查询该视图了。如下:

创建一个明细投影加速
第一步: 选择某个视图,或者是某个基础视图
第二步: 在视图详情页面,进一步选择“投影”标签页。
第三步: 创建一个明细投影 出现以下状态,代表RP已经构建成功了。
第四步: 查看加速命中效果,输入如下SQL(可以看到sql1和sql2都可以命中加速):
--sql1
select * from demo.demo.store_sales_join_item limit 10;
--sql2
SELECT item.i_brand,ss_ext_sales_price,store_sales.ss_item_sk,item.i_item_sk,item.i_manufact_id
FROM gauss.tpcds.store_sales, gauss.tpcds.item
WHERE store_sales.ss_item_sk = item.i_item_sk
AND item.i_manufact_id >= 128
limit 1000
备注:在AIR中的RP命中不局限用户是否SQL直接查询RP对应的View,即便是其它用户写的SQL,如果引擎推导出来该SQL查询所查的数据范围能够从RP中获取,那么也会命中该RP。这种命中能力,可以充分利用咱们大数据业务当中的热点数据特点,当关键的热点数据加速后,其它用户创建的View或者提交的查询SQL,都可以充分利用该RP的加速价值。
创建一个聚合投影加速
聚合投影加速典型场景 :AGG RP技术在典型应用场景中,主要用于提升对特定事实表进行多维度、不同粒度的数据统计分析的效率。该技术针对大规模数据集,能够预先加速好的轻度汇总聚合计算(维度压缩比越高,加速效果越好),让用户对于不同维度进行统计分析的场景进行高效率的查询加速。例如,我们以TPCDS中的store_sales和item为例,如下图,销售表(store_sales)和产品表(item)存在主外键关系,且不为空。
第一步: 我们先选中一张store_sales的表,进入表详情页面,点击“投影”标签页,点击“新建投影”
第二步: 选择“聚合投影”,填入聚合投影的名称(我们一般聚合投影都以agg_加上表名作为聚合投影的名称),勾选ss_item_sk字段作为维度字段,如下图 勾选ss_wholesale_cost, ss_list_price, ss_sales_price作为度量字段,并勾选度量的统计算子,我们这里选择sum、max、min等统计算子,最后一切准备完毕点击“创建”按钮。如下图
第三步: 等待投影构建完成,出现如下界面。
第四步: 验证聚合投影的使用场景,我们基于store_sales(销售记录)来join item(产品)表,按产品生产厂家,和按产品品牌统计总销售金额。生成以下SQL(内部表名,schema名请根据实际情况稍作修改)
--SQL1 按i_brand_id,i_manufact_id统计
select i_brand_id,i_manufact_id, sum(ss_list_price), sum(ss_list_price)
from gauss_retail.tpcds_100w.store_sales ss, gauss_retail.tpcds_100w.item item
where ss.ss_item_sk=item.i_item_sk
group by i_brand_id,i_manufact_id
limit 100
--SQL2 按item表的任意字段作为group by维度统计
select i_manager_id,i_category, sum(ss_wholesale_cost)
from gauss_retail.tpcds_100w.store_sales ss, gauss_retail.tpcds_100w.item item
where ss.ss_item_sk=item.i_item_sk
group by i_manager_id,i_category
limit 100
_
SQL1查询结果如下,可以看到,当前查询会命中之前的轻力度汇总表RP数据。
SQL2查询结果如下,可以看到,SQL2仍然会命中之前的轻力度汇总表RP数据。
多层嵌套视图
在传统数仓场景中,进行数据整合一般会进行分层的数据资产建设,例如贴源层,dwd、dws、adm、ads等不同的分层资产建设,在逻辑化数据平台中,我们也可以通过视图的方式来创建多层的数据资产。下面就以一个具体的例子,可以跟随下面的步骤来创建一个分层的资产目录结构。
创建DWD层明细表
一般情况下,明细层视图主要用于将底层不同源的同一类数据进行平行的整合,也包括部分原始数据的字段补齐等操作。例如,我们在下面的例子里面就会先创建一个销售相关的DWD表,将不同渠道的销售记录汇总成一张总的销售明细表。新建dwd_sales表,SQL代码如下:
SELECT
"cs_sold_date_sk" AS "sold_date_sk",
"cs_sold_time_sk" AS "sold_time_sk",
"cs_ship_date_sk" AS "ship_date_sk",
"cs_item_sk" AS "item_sk",
"cs_ship_addr_sk" AS "ship_addr_sk",
"cs_ship_mode_sk" AS "ship_mode_sk",
"cs_warehouse_sk" AS "warehouse_sk",
"cs_promo_sk" AS "promo_sk",
"cs_bill_addr_sk" AS "bill_addr_sk",
"cs_quantity" AS "quantity",
"cs_wholesale_cost" AS "wholesale_cost",
"cs_list_price" AS "list_price",
"cs_sales_price" AS "sales_price",
"cs_ext_discount_amt" AS "ext_discount_amt",
"cs_ext_sales_price" AS "ext_sales_price",
"cs_ext_wholesale_cost" AS "ext_wholesale_cost",
"cs_ext_list_price" AS "ext_list_price",
"cs_ext_tax" AS "ext_tax",
"cs_coupon_amt" AS "coupon_amt",
"cs_ext_ship_cost" AS "ext_ship_cost"
FROM
"hive_retail"."tpcds"."catalog_sales"
UNION ALL
SELECT
"ss_sold_date_sk" AS "sold_date_sk",
"ss_sold_time_sk" AS "sold_time_sk",
0 AS "ship_date_sk",
"ss_item_sk" AS "item_sk",
0 AS "ship_addr_sk",
0 AS "ship_mode_sk",
0 AS "warehouse_sk",
"ss_promo_sk" AS "promo_sk",
0 AS "bill_addr_sk",
"ss_quantity" AS "quantity",
"ss_wholesale_cost" AS "wholesale_cost",
"ss_list_price" AS "list_price",
"ss_sales_price" AS "sales_price",
"ss_ext_discount_amt" AS "ext_discount_amt",
"ss_ext_sales_price" AS "ext_sales_price",
"ss_ext_wholesale_cost" AS "ext_wholesale_cost",
"ss_ext_list_price" AS "ext_list_price",
"ss_ext_tax" AS "ext_tax",
"ss_coupon_amt" AS "coupon_amt",
0 AS "ext_ship_cost"
FROM
"hive_retail"."tpcds"."store_sales"
UNION ALL
SELECT
"ws_sold_date_sk" AS "sold_date_sk",
"ws_sold_time_sk" AS "sold_time_sk",
"ws_ship_date_sk" AS "ship_date_sk",
"ws_item_sk" AS "item_sk",
"ws_ship_addr_sk" AS "ship_addr_sk",
"ws_ship_mode_sk" AS "ship_mode_sk",
"ws_warehouse_sk" AS "warehouse_sk",
"ws_promo_sk" AS "promo_sk",
"ws_bill_addr_sk" AS "bill_addr_sk",
"ws_quantity" AS "quantity",
"ws_wholesale_cost" AS "wholesale_cost",
"ws_list_price" AS "list_price",
"ws_sales_price" AS "sales_price",
"ws_ext_discount_amt" AS "ext_discount_amt",
"ws_ext_sales_price" AS "ext_sales_price",
"ws_ext_wholesale_cost" AS "ext_wholesale_cost",
"ws_ext_list_price" AS "ext_list_price",
"ws_ext_tax" AS "ext_tax",
"ws_coupon_amt" AS "coupon_amt",
"ws_ext_ship_cost" AS "ext_ship_cost"
FROM
"hive_retail"."tpcds"."web_sales"
新建dwd_sales_with_ship_address 视图,join 发货地址表,进行逻辑打宽。
SELECT
"ds".*,
CASE
WHEN "ca_country" IS NULL THEN 'China'
ELSE "ca_country"
END AS "country",
"ca"."ca_state" AS "state",
"ca"."ca_city" AS "city"
FROM
"demo"."demo2"."dwd_sales" AS "ds",
"hive_retail"."tpcds_100g_parquet"."customer_address" AS "ca"
WHERE
"ds"."ship_addr_sk" = "ca"."ca_address_sk"
创建DWS层细粒度销售统计表
新建dws_sales_sum_by_weeks表,按周统计每周7天,按周从周一到周日的每天打平成一列的销售金额汇总。
SELECT
"dwd_sales_with_ship_address"."country",
"dwd_sales_with_ship_address"."state",
"dwd_sales_with_ship_address"."city",
"d_week_seq",
SUM(
CASE
WHEN "d_day_name" = 'Sunday' THEN "sales_price"
ELSE NULL
END
) AS "sun_sales",
SUM(
CASE
WHEN "d_day_name" = 'Monday' THEN "sales_price"
ELSE NULL
END
) AS "mon_sales",
SUM(
CASE
WHEN "d_day_name" = 'Tuesday' THEN "sales_price"
ELSE NULL
END
) AS "tue_sales",
SUM(
CASE
WHEN "d_day_name" = 'Wednesday' THEN "sales_price"
ELSE NULL
END
) AS "wed_sales",
SUM(
CASE
WHEN "d_day_name" = 'Thursday' THEN "sales_price"
ELSE NULL
END
) AS "thu_sales",
SUM(
CASE
WHEN "d_day_name" = 'Friday' THEN "sales_price"
ELSE NULL
END
) AS "fri_sales",
SUM(
CASE
WHEN "d_day_name" = 'Saturday' THEN "sales_price"
ELSE NULL
END
) AS "sat_sales"
FROM
"demo"."demo2"."dwd_sales_with_ship_address",
"hive_retail"."tpcds"."date_dim"
WHERE
"d_date_sk" = "sold_date_sk"
GROUP BY
"d_week_seq",
"country",
"state",
"city"
查看视图血缘关系
当我们按照前面的步骤依次创建了3层视图嵌套之后,我们打开dws_sales_sum_by_weeks视图,查看血缘关系,我们可以看到如下视图的血缘依赖关系。

参数化视图场景
主要应用场景
- 实现传统数仓里的数据历史快照表功能,使用 AIR 提供的参数化视图功能,可以构建一份快照表。
- 某些场景下的逻辑无法通过一个视图的逻辑来实现,或者实现起来会比较麻烦。例如:涉及到大表跨天聚合统计就 难以通过view进行定义 且 无法按分区增量构建 。典型案例,最近7天交易额计算,且按天滚动累计,每天都是一份最近七天累计的交易额数据,然后看整体的7天累计交易变化趋势;其它还有很多,类似每天统计同环比,且每天都有一份全量的同环比数据等等场景:
select merchant_id, sum(amount) as sum_amount_7d
from dw.fct_order_di
where dt> '${yyyymmdd-7}' and dt<= '${yyyymmdd}'
group by merchant_id;
跨源生成历史快照数据场景
场景假设: 假设要抽取的dwd_sales_snapshot表的数据来源于3个不同的源上的3张表,需要定期抽取一份全量的快照明细。该表新增dt字段作为快照表的分区字段。创建视图的原始SQL如下,具体场景可以根据实际的数据源和schema名称做相应修改:
SELECT
"cs_sold_date_sk" AS "sold_date_sk",
"cs_sold_time_sk" AS "sold_time_sk",
"cs_ship_date_sk" AS "ship_date_sk",
"cs_item_sk" AS "item_sk",
"cs_ship_addr_sk" AS "ship_addr_sk",
"cs_ship_mode_sk" AS "ship_mode_sk",
"cs_warehouse_sk" AS "warehouse_sk",
"cs_promo_sk" AS "promo_sk",
"cs_bill_addr_sk" AS "bill_addr_sk",
"cs_quantity" AS "quantity",
"cs_wholesale_cost" AS "wholesale_cost",
"cs_list_price" AS "list_price",
"cs_sales_price" AS "sales_price",
"cs_ext_discount_amt" AS "ext_discount_amt",
"cs_ext_sales_price" AS "ext_sales_price",
"cs_ext_wholesale_cost" AS "ext_wholesale_cost",
"cs_ext_list_price" AS "ext_list_price",
"cs_ext_tax" AS "ext_tax",
"cs_coupon_amt" AS "coupon_amt",
"cs_ext_ship_cost" AS "ext_ship_cost",
CAST("d_date" AS DATE) AS "dt"
FROM
"hive_retail"."tpcds"."catalog_sales" AS "cs",
"hive_retail"."tpcds"."date_dim"
WHERE
"d_date_sk" = "cs"."cs_sold_date_sk"
AND "date_dim"."d_date" = '$biz_date'
UNION ALL
SELECT
"ss_sold_date_sk" AS "sold_date_sk",
"ss_sold_time_sk" AS "sold_time_sk",
0 AS "ship_date_sk",
"ss_item_sk" AS "item_sk",
0 AS "ship_addr_sk",
0 AS "ship_mode_sk",
0 AS "warehouse_sk",
"ss_promo_sk" AS "promo_sk",
0 AS "bill_addr_sk",
"ss_quantity" AS "quantity",
"ss_wholesale_cost" AS "wholesale_cost",
"ss_list_price" AS "list_price",
"ss_sales_price" AS "sales_price",
"ss_ext_discount_amt" AS "ext_discount_amt",
"ss_ext_sales_price" AS "ext_sales_price",
"ss_ext_wholesale_cost" AS "ext_wholesale_cost",
"ss_ext_list_price" AS "ext_list_price",
"ss_ext_tax" AS "ext_tax",
"ss_coupon_amt" AS "coupon_amt",
0 AS "ext_ship_cost",
CAST("d_date" AS DATE) AS "dt"
FROM
"mysql_retail"."tpcds"."store_sales" AS "ss",
"mysql_retail"."tpcds"."date_dim"
WHERE
"d_date_sk" = "ss"."ss_sold_date_sk"
AND "date_dim"."d_date" = '$biz_date'
UNION ALL
SELECT
"ws_sold_date_sk" AS "sold_date_sk",
"ws_sold_time_sk" AS "sold_time_sk",
"ws_ship_date_sk" AS "ship_date_sk",
"ws_item_sk" AS "item_sk",
"ws_ship_addr_sk" AS "ship_addr_sk",
"ws_ship_mode_sk" AS "ship_mode_sk",
"ws_warehouse_sk" AS "warehouse_sk",
"ws_promo_sk" AS "promo_sk",
"ws_bill_addr_sk" AS "bill_addr_sk",
"ws_quantity" AS "quantity",
"ws_wholesale_cost" AS "wholesale_cost",
"ws_list_price" AS "list_price",
"ws_sales_price" AS "sales_price",
"ws_ext_discount_amt" AS "ext_discount_amt",
"ws_ext_sales_price" AS "ext_sales_price",
"ws_ext_wholesale_cost" AS "ext_wholesale_cost",
"ws_ext_list_price" AS "ext_list_price",
"ws_ext_tax" AS "ext_tax",
"ws_coupon_amt" AS "coupon_amt",
"ws_ext_ship_cost" AS "ext_ship_cost",
CAST("d_date" AS DATE) AS "dt"
FROM
"gauss"."tpcds"."web_sales" AS "ws",
"gauss"."tpcds"."date_dim"
WHERE
"d_date_sk" = "ws"."ws_sold_date_sk"
AND "date_dim"."d_date" = '$biz_date'
第一步: 选择“参数视图”,dwd_sales_snapshot,输入上面的SQL,详细操作如下图:
第二步: 视图创建成功后,预览数据
第三步: 如何在SQL中使用参数化视图,输入以下查询SQL:
select * from demo.demo2.dwd_sales_snapshot('1999-12-12') limit 10
第四步:参数化视图进行RP加速,生成真实物理快照数据 ,上面提到的参数化视图,在没有物化之前,如果上游本身没有历史快照,则当前视图快照数据也会没有,为了在上游没有历史快照数据的情况下生成历史快照数据,必须在当前参数化视图上创建RP。选中前面创建的参数化视图,选择“投影”标签页,点击新建投影。(注意:参数化视图的投影必须是视图的全量字段,这个跟普通视图创建投影不太一样,普通视图投影可以选择视图中的部分字段来创建投影),操作如下图:
第五步: 参数化视图的RP创建成功后,会显示如下图状态,此时RP状态是不可用的,
第六步: 参数化视图补数,参数化视图创建成功后,默认只会从创建之后的日期自动调度并构建RP,但是历史数据的RP需要用户手工去补齐(如果有的话)。下面介绍如何对参数化视图进行补数,首先进入参数化视图的详情页面(点击上图红框中的三个竖点),进入如下页面。
点击上图页面右上角红框部分,展开下拉菜单: 选择“投影补数”菜单项,如下图,输入要补数的日期 查看生成快照的数据有哪些?下图代表当前快照已经存了1999-12-12这一天的数据了。
一旦物理快照数据有一个以上的分区数据,当前快照对应的RP会自动显示为可用状态。
第七步: 在其他场景中使用参数化视图:
-- 参数化视图嵌套参数化视图场景
select * from demo.demo2.dwd_sales_snapshot('$biz_date') --通过传入参数,实现参数化视图的嵌套使用,传入的变量值会用最外层视图传入的变量值进行替换。
-- 普通视图嵌套参数化视图场景 方式一:
select * from demo.demo2.dwd_sales_snapshot('1999-12-12') limit 100 -- 写死参数
-- 普通视图嵌套参数化视图场景 方式二:
select * from demo.demo2.dwd_sales_snapshot.raw_dwd_sales_snapshot where dt='1999-12-12' limit 100 -- 可以通过访问参数化视图的物理表表来使用(前提是参数化视图创建RP,且RP状态可用情况下可用,raw_dwd_sales_snapshot是参数化视图对应的RP名称)
select * from demo.demo2.dwd_sales_snapshot.raw_dwd_sales_snapshot limit 100 -- 查找快照表的所有数据。当数据量特别大的场景下,不建议如此使用,这个相当于分区表的全表扫描。
-- 方式一无法在视图参数传入函数或者另外一张join表的字段,因此在复杂场景下,一般推荐使用方式二。
select * from demo.demo2.dwd_sales_snapshot.raw_dwd_sales_snapshot where dt=date_add('day', -1, current_date()) limit 100
注意:参数化视图处于哪一层没有特别限制,实际应用场景中,有可能贴源层是参数化视图,下游套普通视图。 也有可能上游是普通视图,下游是参数化视图,然后再嵌套普通视图,无论哪种情况都是支持的
时间窗口聚合累计快照场景
下面的例子创建一个参数化视图dws_sales_7day_rolling,用于滚动保存最近七天的各类统计值,并新增dt字段。
SELECT
sum("cs_quantity") AS "quantity_7day",
sum("cs_wholesale_cost") AS "wholesale_cost_7day",
sum("cs_list_price") AS "list_price_7day",
sum("cs_sales_price") AS "sales_price_7day",
sum("cs_ext_discount_amt") AS "ext_discount_amt_7day",
sum("cs_ext_sales_price") AS "ext_sales_price_7day",
sum("cs_ext_wholesale_cost") AS "ext_wholesale_cost_7day",
sum("cs_coupon_amt") AS "coupon_amt_7day",
sum("cs_ext_ship_cost") AS "ext_ship_cost_7day",
CAST('$biz_date' AS DATE) AS "dt"
FROM
"hive_retail"."tpcds"."catalog_sales" as cs,
"hive_retail"."tpcds"."date_dim"
WHERE "d_date_sk" = cs."cs_sold_date_sk" AND "date_dim"."d_date" >= DATE_ADD('day', -7, CAST('$biz_date' AS date))
AND "date_dim"."d_date" < '$biz_date'
创建名为raw_dws_sales_7day_rolling的RP,形成的快照数据是,dt对应的每条记录都是保存该日期前7天的统计值,如果将整个表的数据汇总起来就是所有过去每天的最近7天滚动快照数据。
select * from demo.demo2.dws_sales_7day_rolling.raw_dws_sales_7day_rolling order by dt desc limit 100

自动分区推导视图场景
本场景主要展示:当我们上游表存在分区字段的情况下(例如数据源表采集到分区),下游的视图进行自动的分区构建推导的案例,假设我们存在一张分区事实表和一张普通全量维度表进行Join关联计算场景,新建dwd_web_sales_par视图,SQL如下:
SELECT
"ws_sold_date_sk" AS "sold_date_sk",
"ws_ext_sales_price" AS "sales_price",
"d_week_seq",
"d_day_name",
"date_dim"."d_date" AS "dt"
FROM
"hive_retail"."tpcds_3g_parquet_partition"."web_sales_par" AS "ws",
"hive_retail"."tpcds"."date_dim"
WHERE
"d_date" = "ws_sold_date_sk"
全量RP构建
当我们对视图创建了一个RP,如果视图既无法推导出分区构建也无法推导出增量构建,则默认会进行全量的数据构建,例如,咱们前面介绍的demo.demo.store_sales_join_item视图创建的RP,即为一个全量构建RP:

-
对视图新增字段* :当我们对demo.demo.store_sales_join_item逻辑进行修改,但新的修改如果没有改动已有RP勾选的字段逻辑,只是新增了一些新的字段,如下图:我们新增了i_rec_start_date,则RP不会执行任何动作,只是新增字段在RP中处于未勾选状态。当用户修改RP,比如勾上i_rec_start_date,那么会触发RP的重新全量构建。

-
对视图已有字段的计算口径进行修改* ,例如我们删除了视图中的某个字段,则RP会自动呈现为以下状态:
遇到上述状态出现时,查询会不命中RP,需要手工“停用”并重新“启用”RP,待RP构建成功后,才能被查询正常命中。 -
全量RP的补数:如果想对全量RP进行重新跑数,存在以下两条路径: ①路径一、进入RP详情界面,点击“任务依赖”,在依赖关系图中,可以通过任何一张源头表,点击“...”,弹出菜单选择“更新投影”,操作如下图: ②路径二、可以通过投影的停用和重新启用来强制触发RP的重新构建。
- RP构建失败手工重跑:如果RP构建失败,则进入RP详情界面,点击“更新记录”,进入更新记录列表,选择最近一套记录,点击“重跑”按钮,即可以触发RP任务的重新构建。
增量RP 构建
增量构建一般用于贴源层进行增量数据构建的场景,例如,我们在mysql业务库中有一张表,如果我希望增量同步该表的数据,那么我们可以通过以下步骤来实现:
- 第一步,在“数据集成”,选择某个数据源的贴源表,进入表详情界面,点击“编辑”,然后进入如下操作界面: 备注:支持增量构建的字段,必须是带自增属性的字段,例如,业务上如果只会新增的数据库自增id且没有更新记录,或者是表中的更新时间又或是创建时间字段等,上述字段可作为增量构建字段。 增量构建的原理是,通过每次构建都记录下当前该字段的值,下次构建时,通过获取">上次获取值"的方式来获取上次更新到当前需要同步的增量数据,目前AIR支持增量构建的字段类型“INTEGER,VARCHAR,DATE,DATETIME,LONG”等类型。
常见问题Q&A
- Q:为何HIVE2, HIVE3, HIVE+ICEBERG表有时候查询会很慢,查过一次之后就变快了?
- A:底层引擎在首次缓存加载Hive Meta的时候会存在一定耗时,一旦元数据缓存加载后,后续查询会命中缓存。