Text modeling

This article builds on the concepts and techniques contained in other articles found on this site. The example contained here goes beyond the descriptive analysis found in the Text Mining article. It shows how to pre-process, and then model text data. This article also expands on ML Pipelines, by providing more “real life” scenario of how and why to use pipelines.

Data

This article uses text data from the modeldata package. The Fine foods example data contains reviews of fine foods from Amazon. The package contains a training and a test set. The data consist of a product code, the text of the review, and the score. The score has two values: “great”, and “other”.

library(modeldata)

data("small_fine_foods")

training_data %>% 
  head(1) %>% 
  as.list()
#> $product
#> [1] "B000J0LSBG"
#> 
#> $review
#> [1] "this stuff is  not stuffing  its  not good at all  save your money"
#> 
#> $score
#> [1] other
#> Levels: great other

We will start by starting a local session of Spark, and then copying both data sets to our new session.

library(sparklyr)

sc <- spark_connect(master = "local", version = "3.3")

sff_training_data <- copy_to(sc, training_data)

sff_testing_data <- copy_to(sc, testing_data)

Text transformers

Split into words (tokenizer)

We will split each review into individual words, or tokens. The ft_tokenizer() function returns a in-line list containing the individual words.

sff_training_data %>% 
  ft_tokenizer(
    input_col = "review",
    output_col = "word_list"
  ) %>% 
  select(3:4)
#> # Source: spark<?> [?? x 2]
#>    score word_list   
#>    <chr> <list>      
#>  1 other <list [17]> 
#>  2 great <list [100]>
#>  3 great <list [106]>
#>  4 great <list [36]> 
#>  5 great <list [18]> 
#>  6 great <list [30]> 
#>  7 other <list [87]> 
#>  8 great <list [54]> 
#>  9 great <list [59]> 
#> 10 great <list [44]> 
#> # … with more rows

Clean-up words (stop words)

There are words very common in text, words such as: “the”, “and”, “or”, etc. These are called “stop words”. Most often, stop words are not useful in analysis and modeling, so it is necessary to remove them. That is exactly what ft_stop_words_remover() does. In addition to English, Spark has lists of stop words for several other languages. In the resulting table, notice that the number of words in the wo_stop_words is lower than the word_list.

sff_training_data %>% 
  ft_tokenizer(
    input_col = "review",
    output_col = "word_list"
  ) %>% 
  ft_stop_words_remover(
    input_col = "word_list", 
    output_col = "wo_stop_words"
    ) %>% 
  select(3:5) 
#> # Source: spark<?> [?? x 3]
#>    score word_list    wo_stop_words
#>    <chr> <list>       <list>       
#>  1 other <list [17]>  <list [9]>   
#>  2 great <list [100]> <list [61]>  
#>  3 great <list [106]> <list [67]>  
#>  4 great <list [36]>  <list [20]>  
#>  5 great <list [18]>  <list [9]>   
#>  6 great <list [30]>  <list [17]>  
#>  7 other <list [87]>  <list [58]>  
#>  8 great <list [54]>  <list [33]>  
#>  9 great <list [59]>  <list [36]>  
#> 10 great <list [44]>  <list [24]>  
#> # … with more rows

Index words (hash)

Text hashing maps a sequence of words, or “terms”, to their frequencies. The number of terms that are mapped can be controlled using the num_features argument in ft_hashing_ft(). Because we are eventually going to use a logistic regression model, we will need to override the frequencies from their original value to 1. This is accomplished by setting the binary argument to TRUE.

sff_training_data %>%
  ft_tokenizer(
    input_col = "review",
    output_col = "word_list"
  ) %>% 
  ft_stop_words_remover(
    input_col = "word_list", 
    output_col = "wo_stop_words"
    ) %>% 
  ft_hashing_tf(
    input_col = "wo_stop_words", 
    output_col = "hashed_features", 
    binary = TRUE, 
    num_features = 1024
    ) %>%
  select(3:6) 
#> # Source: spark<?> [?? x 4]
#>    score word_list    wo_stop_words hashed_features
#>    <chr> <list>       <list>        <list>         
#>  1 other <list [17]>  <list [9]>    <dbl [1,024]>  
#>  2 great <list [100]> <list [61]>   <dbl [1,024]>  
#>  3 great <list [106]> <list [67]>   <dbl [1,024]>  
#>  4 great <list [36]>  <list [20]>   <dbl [1,024]>  
#>  5 great <list [18]>  <list [9]>    <dbl [1,024]>  
#>  6 great <list [30]>  <list [17]>   <dbl [1,024]>  
#>  7 other <list [87]>  <list [58]>   <dbl [1,024]>  
#>  8 great <list [54]>  <list [33]>   <dbl [1,024]>  
#>  9 great <list [59]>  <list [36]>   <dbl [1,024]>  
#> 10 great <list [44]>  <list [24]>   <dbl [1,024]>  
#> # … with more rows

Normalize results

Finally, we normalize the hashed column using ft_normalizer() .

sff_training_data %>% 
  ft_tokenizer(
    input_col = "review",
    output_col = "word_list"
  ) %>% 
  ft_stop_words_remover(
    input_col = "word_list", 
    output_col = "wo_stop_words"
    ) %>% 
  ft_hashing_tf(
    input_col = "wo_stop_words", 
    output_col = "hashed_features", 
    binary = TRUE, 
    num_features = 1024
    ) %>%
  ft_normalizer(
    input_col = "hashed_features", 
    output_col = "normal_features"
    ) %>% 
  select(3:7) 
#> # Source: spark<?> [?? x 5]
#>    score word_list    wo_stop_words hashed_features normal_features
#>    <chr> <list>       <list>        <list>          <list>         
#>  1 other <list [17]>  <list [9]>    <dbl [1,024]>   <dbl [1,024]>  
#>  2 great <list [100]> <list [61]>   <dbl [1,024]>   <dbl [1,024]>  
#>  3 great <list [106]> <list [67]>   <dbl [1,024]>   <dbl [1,024]>  
#>  4 great <list [36]>  <list [20]>   <dbl [1,024]>   <dbl [1,024]>  
#>  5 great <list [18]>  <list [9]>    <dbl [1,024]>   <dbl [1,024]>  
#>  6 great <list [30]>  <list [17]>   <dbl [1,024]>   <dbl [1,024]>  
#>  7 other <list [87]>  <list [58]>   <dbl [1,024]>   <dbl [1,024]>  
#>  8 great <list [54]>  <list [33]>   <dbl [1,024]>   <dbl [1,024]>  
#>  9 great <list [59]>  <list [36]>   <dbl [1,024]>   <dbl [1,024]>  
#> 10 great <list [44]>  <list [24]>   <dbl [1,024]>   <dbl [1,024]>  
#> # … with more rows

Important concept

The ft_hashing_tf() outputs the index and frequency of each term. This can be thought of as how “dummy variables” are created for each discrete value of a categorical variable. This means that for modeling, we will only need to use only one “column”, hashed_features. But, we will use normal_features for the model because it is derived from hashed_features.

Prepare the model with an ML Pipeline

The same set of complex transformations are needed for both modeling, and predictions. This means that we will have to duplicate the code for both. This is not ideal when developing, because any change in the transformation will have to be copied to both sets of code. This makes a compelling argument for using ML Pipelines.

We can initialize a pipeline (using ml_pipeline()), and then pass the same exact steps used in the previous section. We then append the model via ft_r_formula() and then the model function, in this case ml_logistic_regression()

sff_pipeline <- ml_pipeline(sc) %>% 
  ft_tokenizer(
    input_col = "review",
    output_col = "word_list"
  ) %>% 
  ft_stop_words_remover(
    input_col = "word_list", 
    output_col = "wo_stop_words"
    ) %>% 
  ft_hashing_tf(
    input_col = "wo_stop_words", 
    output_col = "hashed_features", 
    binary = TRUE, 
    num_features = 1024
    ) %>%
  ft_normalizer(
    input_col = "hashed_features", 
    output_col = "normal_features"
    ) %>% 
  ft_r_formula(score ~ normal_features) %>% 
  ml_logistic_regression()  

sff_pipeline
#> Pipeline (Estimator) with 6 stages
#> <pipeline__87caaa39_2fa9_4708_a1e1_20ab570c8917> 
#>   Stages 
#>   |--1 Tokenizer (Transformer)
#>   |    <tokenizer__e3cf3ba6_f7e9_4a05_a41d_11963d70fd6c> 
#>   |     (Parameters -- Column Names)
#>   |      input_col: review
#>   |      output_col: word_list
#>   |--2 StopWordsRemover (Transformer)
#>   |    <stop_words_remover__3fc0bf48_9fa0_441a_9bb3_5a19ec72be0f> 
#>   |     (Parameters -- Column Names)
#>   |      input_col: word_list
#>   |      output_col: wo_stop_words
#>   |--3 HashingTF (Transformer)
#>   |    <hashing_tf__3fa3d087_39e8_4668_9921_28150a53412c> 
#>   |     (Parameters -- Column Names)
#>   |      input_col: wo_stop_words
#>   |      output_col: hashed_features
#>   |--4 Normalizer (Transformer)
#>   |    <normalizer__6d4d9c1c_7488_4a4d_8d42_d9830de4ee2f> 
#>   |     (Parameters -- Column Names)
#>   |      input_col: hashed_features
#>   |      output_col: normal_features
#>   |--5 RFormula (Estimator)
#>   |    <r_formula__4ae7b190_ce59_4d5f_b75b_cbd623e1a790> 
#>   |     (Parameters -- Column Names)
#>   |      features_col: features
#>   |      label_col: label
#>   |     (Parameters)
#>   |      force_index_label: FALSE
#>   |      formula: score ~ normal_features
#>   |      handle_invalid: error
#>   |      stringIndexerOrderType: frequencyDesc
#>   |--6 LogisticRegression (Estimator)
#>   |    <logistic_regression__46c6e5fb_7c70_44f2_a366_f0a7f94801e1> 
#>   |     (Parameters -- Column Names)
#>   |      features_col: features
#>   |      label_col: label
#>   |      prediction_col: prediction
#>   |      probability_col: probability
#>   |      raw_prediction_col: rawPrediction
#>   |     (Parameters)
#>   |      aggregation_depth: 2
#>   |      elastic_net_param: 0
#>   |      family: auto
#>   |      fit_intercept: TRUE
#>   |      max_iter: 100
#>   |      maxBlockSizeInMB: 0
#>   |      reg_param: 0
#>   |      standardization: TRUE
#>   |      threshold: 0.5
#>   |      tol: 1e-06

Fit and predict

sff_pipeline is an ML Pipeline, which is essentially a set of steps to take, can be think of akin to a recipe. In order to actually process de model we use ml_fit(). This executes all of the transformations, and then fits the model. In other words, ml_fit() runs all of the steps in the pipeline. The output will be considered an ML Pipeline Model.

sff_pipeline_model <- ml_fit(sff_pipeline, sff_training_data)

sff_pipeline_model is more than just a “fitted” model. It also contains all of the pre-processing steps. So any new data passed through it, will go through the same transformations before running the predictions. To execute the pipeline model on against the test data, we use ml_transform()

sff_test_predictions <- sff_pipeline_model %>% 
  ml_transform(sff_testing_data) 

glimpse(sff_test_predictions)
#> Rows: ??
#> Columns: 12
#> Database: spark_connection
#> $ product         <chr> "B005GXFP60", "B000G7V394", "B004WJAULO", "B003D4MBOS"…
#> $ review          <chr> "These are the best tasting gummy fruits I have ever e…
#> $ score           <chr> "great", "great", "other", "other", "great", "other", …
#> $ word_list       <list> ["these", "are", "the", "best", "tasting", "gummy", "…
#> $ wo_stop_words   <list> ["best", "tasting", "gummy", "fruits", "ever", "eaten…
#> $ hashed_features <list> <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ normal_features <list> <0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.000000…
#> $ features        <list> <0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.000000…
#> $ label           <dbl> 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, …
#> $ rawPrediction   <list> <8.570594, -8.570594>, <-0.1648486, 0.1648486>, <-1.9…
#> $ probability     <list> <0.9998104359, 0.0001895641>, <0.4588809, 0.5411191>,…
#> $ prediction      <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, …

Using ml_metrics_binary(), we can see how well the model performed.

ml_metrics_binary(sff_test_predictions)
#> # A tibble: 2 × 3
#>   .metric .estimator .estimate
#>   <chr>   <chr>          <dbl>
#> 1 roc_auc binary         0.706
#> 2 pr_auc  binary         0.567

Tune the model (optional)

The performance of the model may be acceptable, but there could be a desire to improve it. Hyper parameter tuning can be applied to figure if there are better function arguments to use. A big advantage of using an ML Pipeline for the initial model, is that we can literally use the exact same pipeline code to perform the tuning. The Grid Search Tuning article shows how to do this.