PyTorch Frame 0.2.0
We are excited to announce the second release of PyTorch Frame ๐ถ
PyTorch Frame 0.2.0 is the cumulation of work from many contributors from and outside Kumo who have worked on features and bug-fixes for a total of over 120 commits since torch-frame==0.1.0
.
PyTorch Frame is featured in the Relational Deep Learning paper and used as the encoding layer for PyG.
Kumo is also hiring interns working on cool deep learning projects. If you are interested, feel free to apply through this link.
If you have any questions or would like to contribute to PyTorch Frame, feel free to send a question at our slack channel.
Highlights
Support for multicategorical
, timestamp
,text_tokenized
and embedding
stypes
We have added support for four more semantic types. Adding the new stypes
allows for more flexibility to encode raw data. To understand how to specify different semantic types for your data, you can take a look at the tutorial. We also added many new StypeEncoder
for the different new semantic types.
Integration with Large Language Models
We now support two types of integration with LLMs--embedding and fine-tuning.
You can use any embeddings generated by LLMs with PyTorch Frame, either by directly feeding the embeddings as raw data of embedding
stype or using text as raw data of text_embedded
stype and specifying the text_embedder
for each column. Here is an example of how you can use PyTorch Frame with text embeddings generated by OpenAI, Cohere, VoyageAI and HuggingFace transformers.
text_tokenized
enables users to fine-tune Large Language Models on text columns, along with other types of raw tabular data, on any downstream task. In this example, we fine-tuned both the full distilbert-base-uncased model and with LoRA.
More Benchmarks
We added more benchmark results in the benchmark section. LightGBM
is included in the list of GBDTs that we compare with the deep learning models. We did initial experiments on various LLMs as well.
Breaking Changes
text_tokenized_cfg
andtext_embedder_cfg
are renamed tocol_to_text_tokenized_cfg
andcol_to_text_embedder_cfg
respectively (#257). This allows users to specify different embedders, tokenizers for different text columns.- Now
Trompt
outputs 2-dim embeddings inforward
.
Features
-
We now support the following new encoders:
LinearEmbeddingEncoder
forembedding
stype,TimestampEncoder
fortimestamp
stype andMultiCategoricalEmbeddingEncoder
formulticategorical
stype. -
LightGBM
is added to GDBTs module. -
Auto-inference of stypes from raw DataFrame columns is supported through
infer_df_stype
function. However, the correctness of the inference is not guaranteed and we suggest you to double-check.
Bugfixes
We fixed the in_channels
calculation of ResNet
(#220) and improved the overall user experience on handling dirty data (#171 #234 #264).
Full Changelog
Full Changelog: 0.1.0...0.2.0