Skip to content

PyTorch Frame 0.2.0

Compare
Choose a tag to compare
@yiweny yiweny released this 15 Dec 23:20
· 136 commits to master since this release
3f1a695

We are excited to announce the second release of PyTorch Frame ๐Ÿถ

PyTorch Frame 0.2.0 is the cumulation of work from many contributors from and outside Kumo who have worked on features and bug-fixes for a total of over 120 commits since torch-frame==0.1.0.

PyTorch Frame is featured in the Relational Deep Learning paper and used as the encoding layer for PyG.

Kumo is also hiring interns working on cool deep learning projects. If you are interested, feel free to apply through this link.

If you have any questions or would like to contribute to PyTorch Frame, feel free to send a question at our slack channel.

Highlights

Support for multicategorical, timestamp,text_tokenized and embedding stypes

We have added support for four more semantic types. Adding the new stypes allows for more flexibility to encode raw data. To understand how to specify different semantic types for your data, you can take a look at the tutorial. We also added many new StypeEncoder for the different new semantic types.

Integration with Large Language Models

We now support two types of integration with LLMs--embedding and fine-tuning.

You can use any embeddings generated by LLMs with PyTorch Frame, either by directly feeding the embeddings as raw data of embedding stype or using text as raw data of text_embedded stype and specifying the text_embedder for each column. Here is an example of how you can use PyTorch Frame with text embeddings generated by OpenAI, Cohere, VoyageAI and HuggingFace transformers.

text_tokenized enables users to fine-tune Large Language Models on text columns, along with other types of raw tabular data, on any downstream task. In this example, we fine-tuned both the full distilbert-base-uncased model and with LoRA.

More Benchmarks

We added more benchmark results in the benchmark section. LightGBM is included in the list of GBDTs that we compare with the deep learning models. We did initial experiments on various LLMs as well.

Breaking Changes

  • text_tokenized_cfg and text_embedder_cfg are renamed to col_to_text_tokenized_cfg and col_to_text_embedder_cfg respectively (#257). This allows users to specify different embedders, tokenizers for different text columns.
  • Now Trompt outputs 2-dim embeddings in forward.

Features

  • We now support the following new encoders: LinearEmbeddingEncoder for embedding stype, TimestampEncoder for timestamp stype and MultiCategoricalEmbeddingEncoder for multicategorical stype.

  • LightGBM is added to GDBTs module.

  • Auto-inference of stypes from raw DataFrame columns is supported through infer_df_stype function. However, the correctness of the inference is not guaranteed and we suggest you to double-check.

Bugfixes

We fixed the in_channels calculation of ResNet(#220) and improved the overall user experience on handling dirty data (#171 #234 #264).

Full Changelog

Full Changelog: 0.1.0...0.2.0