marp | title | description | theme | paginate | size |
---|---|---|---|---|---|
true |
F1 doesn't matter (and other stories from the industry) |
IN5550 2023 Guest lecture |
lead |
true |
58140 |
Murhaf Fares (he/him) Emanuele Lapponi (he/him) Machine intelligence @Fremtind
- LTG PhD class of 2019
- Worked on NLP most of our professional lives
- Mostly together 🤗
- One of Norway's leading insurance companies, owned by SB1 and DNB
- Data is at the core of most insurance processes
~/talks/f1-doesnt-matter main ✔ 0m
▶ cat slides.md | grep "^# \*\*"
# **Quick NLP hacks can have a big impact**
# **Visualize all the things**
# **On customer satisfaction and feedback**
# **A simple model is enough**
# **Make the most of small data**
# **F1 doesn't matter**
# **Budget UI is better than no UI**
# **Model management is hard**
# **The world beyond NLP and deep learning**
# **Being cool always beats raw technical prowess**
Task: Given a "tell us what happened"-description, find out whether the travel destination is in Norway, Europe, or in the rest of the world
What worked:
- Some off-the-shelf preprocessing with Spacy 💝
- Fuzzy matching against a knowledge-base of places
🗺️ Seeing is believing
- We work with a bunch of different datasets
- Some neural representations might be better than others
- Some might be cheaper
- What's a representation anyway? It's important to "build trust" across teams and customers. "Seeing is believing"
import altair as alt
from umap import UMAP
two_d = UMAP().fit_transform(df["embeddings"])
df["x"], df["y"] = two_d[:, 0], two_d[:, 1]
alt.Chart(df).mark_circle(size=60).encode(
x='x',
y='y',
tooltip=["text"]
).properties(width=1000, height=500).interactive()
Customer surveys are an important tool to improve processes in most product companies. At Fremtind, customers have the opportunity to give us written feedbacks at different steps of their journey: as they purchase new insurances, update/upgrade/review their coverage, and, perhaps most interestingly for our business, after a claim has been approved or rejected (some 50K a year.)
"Vanilla" sentiment analysis is not necessarily useful to analyze and extract value from these written messages: Most messages come pre-equipped with a 🎲 or a 👍/👎.
These user generated scoring systems are useful to take the general temperature of customer satisfaction, but fall short of providing a ranking of messages that are relevant to address different business problems.
There are 3 key drivers for CS in insurance: (1) being paid, (2) being paid quickly, (3) clarity and respect in the claim process.
For (1), there is just not much we can do to improve: the damage is either covered or it isn't. For (2), well, we know. We either hire more temps in dire times, or some people are just going to have to be patient.
For (3), we decided that absolutely all customers should be met with clear language and a high level of empathy and understanding.
The degrees to which a given message is about unclear language, or whether the customer was met with impatience and/or antagonism become interesting dimensions.
In this project, we score customer feedback and other interactions from a variety of sources with the degree to which they pertain different (relevant) aspects of communication. This enables our (hundreds of) colleagues to read more relevant messages and use text scores as quantities for their KPIs.
😎 Or: how I learned how to stop worrying and just throw some pretrained embeddings at a dense layer
- Each of the aforementioned dimensions translates to a simple text classification problem
- Supervised learning: a training set of pairs (text, relevant/irrelevant)
- We train a simple feedforward neural network with input representation from sentence transformers (e.g., Bert, Universal Sentence Encoders, etc...)
- Shoutout
https://huggingface.co/NbAiLab/nb-sbert-base
! 🙌
- Find extra (textual) features such as the URL of the page on which the customer wrote their feedback
- Preprocess the URL and embed it
- We have different models for different teams
- Each team define their own categories, but those overlap
- This can be exploited by simply initializing the model weights in model B using weights from model A
- Text preprocessing is a must regardless of how good your embeddings are
- For example, anonymization: no consumer data should leak into training data
→ Don't sleep on embed/reduce/scatter!
- REST API in a microservice
- Scheduled tasks: fetch the data every day/week/month, score it and write it back
🤬 Or: nobody is going to like you if you constantly ask people to annotate data for you
... but we have no data
Instead of asking people to annotate more data, can we pick their brains about how they would annotate the data and use that expertise somehow?
- (Re)use existing sentiment dataset to create new datasets for clarity and impudence
- Translate their expertise to labelling functions to produce so-called silver data
$(𝑋,\widetilde Y)$
def lf_short_message_positive(x):
if len(x.doc) < 5 and x.sentiment_score < 0.4:
return 0
return -1
def lf_contains_one_strong_and_negative(x):
if x.sentiment_score >= 0.5 and contains_word(x.doc, ASSETS["UNCLARITY"]["strong"]) :
return 1
return -1
def lf_six_and_positive(x):
if x.sentiment_score < 0.5 and x[dice_col_unc] == 6:
return 0
return -1
💖 Or: does it spark joy?
- Of course F1 and other metrics do matter, but they are not the only thing we consider
- What is a good model, anyway?
- What is good enough when we are limited by time and resources?
- Short feedback messages tend to be positive, whereas longer ones fall more on the negative side
- A recurrent neural network could learn that
- Would the model output still be interesting? Depends on how it will be used
- Simplicity, efficiency vs improving F1 by 0.1
- No point in e.g. spending a lot of resources and time on fine-tuning a Bert, if vanilla works well enough
- Hp-tuning is a waste of time if you don't like what comes out of the model
- Precision and/or recall, at thresholds!
- Sometimes a bad model is better than no model at all
- Can we tweak the system implementation so that false positives don't matter as much?
🎨 Or: unleash your inner frontend dev
🔥 Hot take
https://streamlit.io/
→ cmd-tab to Cura
🎨 Don't do it (yourself)
Long story short: you gotta keep track of models, data, model versions, val/test splits, evaluation plots, comments, and more.
We use https://github.com/mlflow/mlflow/
→ cmd-tab to ml-flow
🚪 Tabular data, for example
- A system that produces a score to reflect 'problematic' customer profiles
- Trained on e.g. fraud
- Working with tabular data, gradient boosting and old-school feature engineering is satisfying ✨
- A viper's nest of potential ethical pitfalls; explainable AI is a MUST
→ Check out:
https://medium.com/fremtind/forsikring-og-muffens-fa2e8cfbca5b
https://github.com/odaon/muffins-ai-motor
Some call it MLOps
🤝 Or: be a bud
- Dropping model names like you are swapping pokemon cards
- Gatekeeping and arrogance
- Toxic code reviews
- Now that every LinkedIn influencer is a ChatGPT expert, it's easy to forget that NLP is hard.