Skip to main content

Using BigQuery ML to categorise email inquiries

· One min read

BigQuery ML

Problem description

We want to categorize emails in order to either generate automatic responses, or decrease the amount of manual labor needed.

The Data

We have extracted and anonymized actual inquiries received from Enturs Kundesenter, and their corresponding category.

The model

To create (and train) the model we simply run:

CREATE OR REPLACE MODEL
`entur-analytics-rtd.hackathon_2023_q1_lag4.auto_ml_all_top4` OPTIONS (
model_type='AUTOML_CLASSIFIER',
input_label_cols=['label'],
budget_hours=1
) AS
SELECT
ML.ngrams(REGEXP_EXTRACT_ALL(LOWER(Email), '[a-zæøå]+')) AS words,
operation AS label
FROM
`entur-analytics-rtd.hackathon_2023_q1_lag4.emails_top4`
WHERE
operation IS NOT NULL

where we specify the column(s) we want to predict. This query creates a BigQuery ML model that we can query to obtain predictions on new data. We could for instance predict the category for the email: "Jeg ønsker å avbestille min billett fra Oslo til Bergen" in the following manner:

select * from ML.PREDICT(MODEL `entur-analytics-rtd.hackathon_2023_q1_lag4.logreg_train_top4`, 
(
select ML.NGRAMS(REGEXP_EXTRACT_ALL(LOWER("Jeg ønsker å avbestille min billett fra Oslo til Bergen"), '[a-zæøå]+'), [1, 2]) as words
)
)

which returns the likelihood of each category from the model. In this case the model correctly predicts the category "avbestilling/refusjon" with a 40 % likelihood.