Add support for componet based Custom op (#430)

* add custom op & kd loss & add support for rank distillation
alibaba · Sep 12, 2024 · c1147b2 · c1147b2
1 parent cbec539
commit c1147b2
Show file tree

Hide file tree

Showing 51 changed files with 1,511 additions and 88 deletions.
diff --git a/docs/source/component/custom_op.md b/docs/source/component/custom_op.md
@@ -0,0 +1,134 @@
+# 使用自定义 OP
+
+当内置的tf算子不能满足业务需求，或者通过组合现有算子实现需求的性能较差时，可以考虑自定义tf的OP。
+
+1. 实现自定义算子，编译为动态库
+   - 参考官方示例：[TensorFlow Custom Op](https://github.com/tensorflow/custom-op/)
+   - 注意：自定义Op的编译依赖tf版本需要与执行时的tf版本保持一致
+   - 您可能需要为离线训练 与 在线推理服务 编译两个不同依赖环境的动态库
+     - 在PAI平台上需要依赖 tf 1.12 版本编译（先下载pai-tf的官方镜像）
+     - 在EAS的 [EasyRec Processor](https://help.aliyun.com/zh/pai/user-guide/easyrec) 中使用自定义Op需要依赖 tf 2.10.1 编译
+1. 在`EasyRec`中使用自定义Op的步骤
+   1. 下载EasyRec的最新[源代码](https://github.com/alibaba/EasyRec)
+   1. 把上一步编译好的动态库放到`easy_rec/python/ops/${tf_version}`目录，注意版本要子目录名一致
+   1. 开发一个使用自定义Op的组件
+      - 新组件的代码添加到 `easy_rec/python/layers/keras/custom_ops.py`
+      - `custom_ops.py` 提供了一个自定义Op组件的示例
+      - 声明新组件，在`easy_rec/python/layers/keras/__init__.py`文件中添加导出语句
+   1. 编写模型配置文件，使用组件化的方式搭建模型，包含新定义的组件（参考下文）
+   1. 运行`pai_jobs/deploy_ext.sh`脚本，打包EasyRec，并把打好的资源包（`easy_rec_ext_${version}_res.tar.gz`）上传到MaxCompute项目空间
+   1. (在DataWorks里 or 用odpscmd客户端工具) 训练 & 评估 & 导出 模型
+
+## 导出自定义Op的动态库到 saved_model 的 assets 目录
+
+```bash
+pai -name easy_rec_ext
+-Dcmd='export'
+-Dconfig='oss://cold-start/EasyRec/custom_op/pipeline.config'
+-Dexport_dir='oss://cold-start/EasyRec/custom_op/export/final_with_lib'
+-Dextra_params='--asset_files oss://cold-start/EasyRec/config/libedit_distance.so'
+-Dres_project='pai_rec_test_dev'
+-Dversion='0.7.5'
+-Dbuckets='oss://cold-start/'
+-Darn='acs:ram::XXXXXXXXXX:role/aliyunodpspaidefaultrole'
+-DossHost='oss-cn-beijing-internal.aliyuncs.com'
+;
+```
+
+**注意**：
+
+1. 在 训练、评估、导出 命令中需要用`-Dres_project`指定上传easyrec资源包的MaxCompute项目空间名
+1. 在 训练、评估、导出 命令中需要用`-Dversion`指定资源包的版本
+1. asset_files参数指定的动态库会被线上推理服务加载，因此需要在与线上推理服务一致的tf版本上编译。（目前是EAS平台的EasyRec Processor依赖 tf 2.10.1版本）。
+   - 如果 asset_files 参数还需要指定其他文件路径（比如 fg.json），多个路径之间用英文逗号隔开。
+1. 再次强调一遍，**导出的动态库依赖的tf版本需要与推理服务依赖的tf版本保持一致**
+
+## 自定义Op的示例
+
+使用自定义OP求两段输入文本的Term匹配率
+
+```protobuf
+feature_config: {
+  ...
+  features: {
+    feature_name: 'raw_genres'
+    input_names: 'genres'
+    feature_type: PassThroughFeature
+  }
+  features: {
+    feature_name: 'raw_title'
+    input_names: 'title'
+    feature_type: PassThroughFeature
+  }
+}
+model_config: {
+  model_class: 'RankModel'
+  model_name: 'MLP'
+  feature_groups: {
+    group_name: 'text'
+    feature_names: 'raw_genres'
+    feature_names: 'raw_title'
+    wide_deep: DEEP
+  }
+  feature_groups: {
+    group_name: 'features'
+    feature_names: 'user_id'
+    feature_names: 'movie_id'
+    feature_names: 'gender'
+    feature_names: 'age'
+    feature_names: 'occupation'
+    feature_names: 'zip_id'
+    feature_names: 'movie_year_bin'
+    wide_deep: DEEP
+  }
+  backbone {
+    blocks {
+      name: 'text'
+      inputs {
+        feature_group_name: 'text'
+      }
+      raw_input {
+      }
+    }
+    blocks {
+      name: 'match_ratio'
+      inputs {
+        block_name: 'text'
+      }
+      keras_layer {
+        class_name: 'OverlapFeature'
+        overlap {
+          separator: " "
+          default_value: "0"
+          methods: "query_common_ratio"
+        }
+      }
+    }
+    blocks {
+      name: 'mlp'
+      inputs {
+        feature_group_name: 'features'
+      }
+      inputs {
+        block_name: 'match_ratio'
+      }
+      keras_layer {
+        class_name: 'MLP'
+        mlp {
+          hidden_units: [256, 128]
+        }
+      }
+    }
+  }
+  model_params {
+    l2_regularization: 1e-5
+  }
+  embedding_regularization: 1e-6
+}
+```
+
+1. 如果自定义Op需要处理原始输入特征，则在定义特征时指定 `feature_type: PassThroughFeature`
+   - 非 `PassThroughFeature` 类型的特征会在预处理阶段做一些变换，组件代码里拿不到原始值
+1. 自定义Op需要处理的原始输入特征按照顺序放置到同一个`feature group`内
+1. 配置一个类型为`raw_input`的输入组件，获取原始输入特征
+   - 这是目前EasyRec支持的读取原始输入特征的唯一方式
diff --git a/docs/source/kd.md b/docs/source/kd.md
@@ -20,7 +20,7 @@
 
 - label_is_logits: 目标是logits, 还是probs, 默认是logits
 
-- loss_type: loss的类型, 可以是CROSS_ENTROPY_LOSS或者L2_LOSS
+- loss_type: loss的类型, 可以是CROSS_ENTROPY_LOSS、L2_LOSS、BINARY_CROSS_ENTROPY_LOSS、KL_DIVERGENCE_LOSS、PAIRWISE_HINGE_LOSS、LISTWISE_RANK_LOSS等
 
 - loss_weight: loss的权重, 默认是1.0
 
@@ -63,6 +63,45 @@ model_config {
 }
 ```
 
+除了常规的从teacher模型的预测结果里"蒸馏"知识到student模型，在搜推场景中更加推荐采用基于pairwise或者listwise的方式从teacher模型学习
+其对不同item的排序（学习对item预估结果的偏序关系），示例如下：
+
+- pairwise 知识蒸馏
+
+```protobuf
+  kd {
+    loss_name: 'ctcvr_rank_loss'
+    soft_label_name: 'pay_logits'
+    pred_name: 'logits'
+    loss_type: PAIRWISE_HINGE_LOSS
+    loss_weight: 1.0
+    pairwise_hinge_loss {
+      session_name: "raw_query"
+      use_exponent: false
+      use_label_margin: true
+    }
+  }
+```
+
+- listwise 知识蒸馏
+
+```protobuf
+  kd {
+    loss_name: 'ctcvr_rank_loss'
+    soft_label_name: 'pay_logits'
+    pred_name: 'logits'
+    loss_type: LISTWISE_RANK_LOSS
+    loss_weight: 1.0
+    listwise_rank_loss {
+      session_name: "raw_query"
+      temperature: 3.0
+      label_is_logits: true
+    }
+  }
+```
+
+可以为损失函数配置参数，配置方法参考[损失函数](models/loss.md)参数。
+
 ### 训练命令
 
 训练命令不改变, 详细参考[模型训练](./train.md)

diff --git a/docs/source/models/bst.md b/docs/source/models/bst.md
@@ -158,8 +158,8 @@ model_config: {
     group_name: 'sequence'
     feature_names: "cate_id"
     feature_names: "brand"
-    feature_names: "tag_brand_list"
     feature_names: "tag_category_list"
+    feature_names: "tag_brand_list"
     wide_deep: DEEP
   }
   backbone {
@@ -219,6 +219,7 @@ model_config: {
 - feature_groups: 特征组
   - 包含两个feature_group: dense 和sparse group
   - wide_deep: BST模型使用的都是Deep features, 所以都设置成DEEP
+  - 序列组件对应的feature_group的配置方式请查看 [参考文档](../component/sequence.md)
 - backbone: 通过组件化的方式搭建的主干网络，[参考文档](../component/backbone.md)
   - blocks: 由多个`组件块`组成的一个有向无环图（DAG），框架负责按照DAG的拓扑排序执行个`组件块`关联的代码逻辑，构建TF Graph的一个子图
   - name/inputs: 每个`block`有一个唯一的名字（name），并且有一个或多个输入(inputs)和输出

diff --git a/docs/source/models/cl4srec.md b/docs/source/models/cl4srec.md
@@ -157,6 +157,7 @@ model_config: {
   - use_package_input: 当`package`的输入是动态的时，设置该输入占位符，表示当前`block`的输入由调用`package`时指定
   - keras_layer: 加载由`class_name`指定的自定义或系统内置的keras layer，执行一段代码逻辑；[参考文档](../component/backbone.md#keraslayer)
   - SeqAugment: 序列数据增强的组件，参数详见[参考文档](../component/component.md#id5)
+    - SeqAugmentOps: `class_name`指定为`SeqAugmentOps`可以使用自定义OP版本的序列数据增加组件，性能更好
   - AuxiliaryLoss: 计算辅助任务损失函数的组件，参数详见[参考文档](../component/component.md#id7)
   - concat_blocks: DAG的输出节点由`concat_blocks`配置项定义，如果不配置`concat_blocks`，框架会自动拼接DAG的所有叶子节点并输出。
 - model_params:

diff --git a/docs/source/models/din.md b/docs/source/models/din.md
@@ -133,8 +133,8 @@ model_config: {
     group_name: 'sequence'
     feature_names: "cate_id"
     feature_names: "brand"
-    feature_names: "tag_brand_list"
     feature_names: "tag_category_list"
+    feature_names: "tag_brand_list"
     wide_deep: DEEP
   }
   backbone {
@@ -192,6 +192,7 @@ model_config: {
 - feature_groups: 特征组
   - 包含两个feature_group: dense 和sparse group
   - wide_deep: DIN模型使用的都是Deep features, 所以都设置成DEEP
+  - 序列组件对应的feature_group的配置方式请查看 [参考文档](../component/sequence.md)
 - backbone: 通过组件化的方式搭建的主干网络，[参考文档](../component/backbone.md)
   - blocks: 由多个`组件块`组成的一个有向无环图（DAG），框架负责按照DAG的拓扑排序执行个`组件块`关联的代码逻辑，构建TF Graph的一个子图
   - name/inputs: 每个`block`有一个唯一的名字（name），并且有一个或多个输入(inputs)和输出

diff --git a/docs/source/models/loss.md b/docs/source/models/loss.md
@@ -10,16 +10,21 @@ EasyRec支持两种损失函数配置方式：1）使用单个损失函数；2
 | L2_LOSS                                    | 平方损失                                                       |
 | SIGMOID_L2_LOSS                            | 对sigmoid函数的结果计算平方损失                                        |
 | CROSS_ENTROPY_LOSS                         | log loss 负对数损失                                             |
+| BINARY_CROSS_ENTROPY_LOSS                  | 仅用在知识蒸馏中的BCE损失                                             |
+| KL_DIVERGENCE_LOSS                         | 仅用在知识蒸馏中的KL散度损失                                            |
 | CIRCLE_LOSS                                | CoMetricLearningI2I模型专用                                    |
 | MULTI_SIMILARITY_LOSS                      | CoMetricLearningI2I模型专用                                    |
 | SOFTMAX_CROSS_ENTROPY_WITH_NEGATIVE_MINING | 自动负采样版本的多分类softmax_cross_entropy，用在二分类任务中                  |
 | BINARY_FOCAL_LOSS                          | 支持困难样本挖掘和类别平衡的focal loss                                   |
 | PAIR_WISE_LOSS                             | 以优化全局AUC为目标的rank loss                                      |
 | PAIRWISE_FOCAL_LOSS                        | pair粒度的focal loss, 支持自定义pair分组                             |
 | PAIRWISE_LOGISTIC_LOSS                     | pair粒度的logistic loss, 支持自定义pair分组                          |
+| PAIRWISE_HINGE_LOSS                        | pair粒度的hinge loss, 支持自定义pair分组                             |
 | JRC_LOSS                                   | 二分类 + listwise ranking loss                                |
 | F1_REWEIGHTED_LOSS                         | 可以调整二分类召回率和准确率相对权重的损失函数，可有效对抗正负样本不平衡问题                     |
 | ORDER_CALIBRATE_LOSS                       | 使用目标依赖关系校正预测结果的辅助损失函数，详见[AITM](aitm.md)模型                  |
+| LISTWISE_RANK_LOSS                         | listwise的排序损失                                              |
+| LISTWISE_DISTILL_LOSS                      | 用来蒸馏给定list排序的损失函数，与listwise rank loss 比较类似                 |
 
 - 说明：SOFTMAX_CROSS_ENTROPY_WITH_NEGATIVE_MINING
   - 支持参数配置，升级为 [support vector guided softmax loss](https://128.84.21.199/abs/1812.11317) ，
@@ -99,6 +104,16 @@ EasyRec支持两种损失函数配置方式：1）使用单个损失函数；2
   - margin: 当pair的logit之差减去该参数值后再参与计算，即正负样本的logit之差至少要大于margin，默认值为0
   - temperature: 温度系数，logit除以该参数值后再参与计算，默认值为1.0
 
+- PAIRWISE_HINGE_LOSS 的参数配置
+
+  - session_name: pair分组的字段名，比如user_id
+  - temperature: 温度系数，logit除以该参数值后再参与计算，默认值为1.0
+  - margin: 当pair的logit之差大于该参数值时，当前样本的loss为0，默认值为1.0
+  - ohem_ratio: 困难样本的百分比，只有部分困难样本参与loss计算，默认值为1.0
+  - label_is_logits: bool, 标记label是否为teacher模型的输出logits，默认为true
+  - use_label_margin: bool, 是否使用输入pair的label的diff作为margin，设置为true时`margin`参数不生效，默认为true
+  - use_exponent: bool, 是否对模型的输出做pairwise的指数变化，默认为false
+
 备注：上述 PAIRWISE\_\*\_LOSS 都是在mini-batch内构建正负样本pair，目标是让正负样本pair的logit相差尽可能大
 
 - BINARY_FOCAL_LOSS 的参数配置
@@ -115,6 +130,13 @@ EasyRec支持两种损失函数配置方式：1）使用单个损失函数；2
   - 参考论文：《 [Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model](https://arxiv.org/pdf/2208.06164.pdf) 》
   - 使用示例: [dbmtl_with_jrc_loss.config](https://github.com/alibaba/EasyRec/blob/master/samples/model_config/dbmtl_on_taobao_with_multi_loss.config)
 
+- LISTWISE_RANK_LOSS 的参数配置
+
+  - temperature: 温度系数，logit除以该参数值后再参与计算，默认值为1.0
+  - session_name: list分组的字段名，比如user_id
+  - label_is_logits: bool, 标记label是否为teacher模型的输出logits，默认为false
+  - scale_logits: bool, 是否需要对模型的logits进行线性缩放，默认为false
+
 排序模型同时使用多个损失函数的完整示例：
 [cmbf_with_multi_loss.config](https://github.com/alibaba/EasyRec/blob/master/samples/model_config/cmbf_with_multi_loss.config)
 
@@ -159,5 +181,6 @@ EasyRec支持两种损失函数配置方式：1）使用单个损失函数；2
 ### 参考论文：
 
 - 《 Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics 》
-- 《 [Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning](https://arxiv.org/abs/2111.10603) 》
+- [Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning](https://arxiv.org/abs/2111.10603)
 - [AITM: Modeling the Sequential Dependence among Audience Multi-step Conversions with Multi-task Learning in Targeted Display Advertising](https://arxiv.org/pdf/2105.08489.pdf)
+- [Pairwise Ranking Distillation for Deep Face Recognition](https://ceur-ws.org/Vol-2744/paper30.pdf)