Text mining with Spark & sparklyr

This article focuses on a set of functions that can be used for text mining with Spark and sparklyr. The main goal is to illustrate how to perform most of the data preparation and analysis with commands that will run inside the Spark cluster, as opposed to locally in R. Because of that, the amount of data used will be small.

Data source

For this example, there are two files that will be analyzed. They are both the full works of Sir Arthur Conan Doyle and Mark Twain. The files were downloaded from the Gutenberg Project site via the gutenbergr package. Intentionally, no data cleanup was done to the files prior to this analysis. See the appendix below to see how the data was downloaded and prepared.

readLines("arthur_doyle.txt", 10)
 [1] "cover"                         ""                             
 [3] ""                              ""                             
 [5] "The Return of Sherlock Holmes" ""                             
 [7] ""                              ""                             
 [9] "by Sir Arthur Conan Doyle"     ""                             

Data Import

Connect to Spark

An additional goal of this article is to encourage the reader to try it out, so a simple Spark local mode session is used.


sc <- spark_connect(master = "local")


The spark_read_text() is a new function which works like readLines() but for sparklyr. It comes in handy when non-structured data, such as lines in a book, is what is available for analysis.

# Imports Mark Twain's file

twain_path <- paste0("file:///", here::here(), "/mark_twain.txt")
twain <-  spark_read_text(sc, "twain", twain_path)
# Imports Sir Arthur Conan Doyle's file
doyle_path <- paste0("file:///", here::here(), "/arthur_doyle.txt")
doyle <-  spark_read_text(sc, "doyle", doyle_path)

Data transformation

The objective is to end up with a tidy table inside Spark with one row per word used. The steps will be:

  1. The needed data transformations apply to the data from both authors. The data sets will be appended to one another

  2. Punctuation will be removed

  3. The words inside each line will be separated, or tokenized

  4. For a cleaner analysis, stop words will be removed

  5. To tidy the data, each word in a line will become its own row

  6. The results will be saved to Spark memory


  • sdf_bind_rows() appends the doyle Spark Dataframe to the twain Spark Dataframe. This function can be used in lieu of a dplyr::bind_rows() wrapper function. For this exercise, the column author is added to differentiate between the two bodies of work.
all_words <- doyle %>%
  mutate(author = "doyle") %>%
    twain %>%
      mutate(author = "twain")
  }) %>%
  filter(nchar(line) > 0)


  • The Hive UDF, regexp_replace, is used as a sort of gsub() that works inside Spark. In this case it is used to remove punctuation. The usual [:punct:] regular expression did not work well during development, so a custom list is provided. For more information, see the Hive Functions section in the dplyr page.
all_words <- all_words %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " "))


  • ft_tokenizer() uses the Spark API to separate each word. It creates a new list column with the results.
all_words <- all_words %>%
      input_col = "line",
      output_col = "word_list"

head(all_words, 4)
# Source: spark<?> [?? x 3]
  line                          author word_list 
  <chr>                         <chr>  <list>    
1 cover                         doyle  <list [1]>
2 The Return of Sherlock Holmes doyle  <list [5]>
3 by Sir Arthur Conan Doyle     doyle  <list [5]>
4 Contents                      doyle  <list [1]>


  • ft_stop_words_remover() is a new function that, as its name suggests, takes care of removing stop words from the previous transformation. It expects a list column, so it is important to sequence it correctly after a ft_tokenizer() command. In the sample results, notice that the new wo_stop_words column contains less items than word_list.
all_words <- all_words %>%
    input_col = "word_list",
    output_col = "wo_stop_words"

head(all_words, 4)
# Source: spark<?> [?? x 4]
  line                          author word_list  wo_stop_words
  <chr>                         <chr>  <list>     <list>       
1 cover                         doyle  <list [1]> <list [1]>   
2 The Return of Sherlock Holmes doyle  <list [5]> <list [3]>   
3 by Sir Arthur Conan Doyle     doyle  <list [5]> <list [4]>   
4 Contents                      doyle  <list [1]> <list [1]>   


  • The Hive UDF explode performs the job of unnesting the tokens into their own row. Some further filtering and field selection is done to reduce the size of the dataset.
all_words <- all_words %>%
  mutate(word = explode(wo_stop_words)) %>%
  select(word, author) %>%
  filter(nchar(word) > 2)

head(all_words, 4)
# Source: spark<?> [?? x 2]
  word     author
  <chr>    <chr> 
1 cover    doyle 
2 return   doyle 
3 sherlock doyle 
4 holmes   doyle 


  • compute() will operate this transformation and cache the results in Spark memory. It is a good idea to pass a name to compute() to make it easier to identify it inside the Spark environment. In this case the name will be all_words
all_words <- all_words %>%

Full code

This is what the code would look like on an actual analysis:

all_words <- doyle %>%
  mutate(author = "doyle") %>%
    twain %>%
      mutate(author = "twain")
  }) %>%
  filter(nchar(line) > 0) %>%
  mutate(line = regexp_replace(line, "[_\"\'():;,.!?\\-]", " ")) %>%
    input_col = "line",
    output_col = "word_list"
  ) %>%
    input_col = "word_list",
    output_col = "wo_stop_words"
  ) %>%
  mutate(word = explode(wo_stop_words)) %>%
  select(word, author) %>%
  filter(nchar(word) > 2) %>%

Data Analysis

Words used the most

word_count <- all_words %>%
  count(author, word) %>% 

# Source: spark<?> [?? x 3]
   author word              n
   <chr>  <chr>         <dbl>
 1 doyle  empty           398
 2 doyle  students        109
 3 doyle  golden          303
 4 doyle  abbey           164
 5 doyle  grange           18
 6 doyle  year            866
 7 doyle  world          1520
 8 doyle  circumstances   284
 9 doyle  particulars      49
10 doyle  crime           357
# … with more rows

Words used by Doyle and not Twain

doyle_unique <- filter(word_count, author == "doyle") %>%
    filter(word_count, author == "twain"), 
    by = "word"
    ) %>%

doyle_unique %>% 
# Source:     spark<?> [?? x 3]
# Ordered by: -n
   author word          n
   <chr>  <chr>     <dbl>
 1 doyle  nigel       972
 2 doyle  alleyne     500
 3 doyle  ezra        421
 4 doyle  maude       337
 5 doyle  aylward     336
 6 doyle  lestrade    311
 7 doyle  catinat     301
 8 doyle  sharkey     281
 9 doyle  summerlee   248
10 doyle  congo       211
# … with more rows
doyle_unique %>%
  arrange(-n) %>%
  head(100) %>%
  collect() %>%
    colors = c("#999999", "#E69F00", "#56B4E9", "#56B4E9")

Twain and Sherlock

The word cloud highlighted something interesting. The word “lestrade” is listed as one of the words used by Doyle but not Twain. Lestrade is the last name of a major character in the Sherlock Holmes books. It makes sense that the word “sherlock” appears considerably more times than “lestrade” in Doyle’s books, so why is Sherlock not in the word cloud? Did Mark Twain use the word “sherlock” in his writings?

all_words %>%
    author == "twain",
    word == "sherlock"
    ) %>%
# Source: spark<?> [?? x 1]
1    16

The all_words table contains 16 instances of the word sherlock in the words used by Twain in his works. The instr Hive UDF is used to extract the lines that contain that word in the twain table. This Hive function works can be used instead of base::grep() or stringr::str_detect(). To account for any word capitalization, the lower command will be used in mutate() to make all words in the full text lower cap.

instr() & lower()

Most of these lines are in a short story by Mark Twain called A Double Barrelled Detective Story. As per the Wikipedia page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.

twain %>%
  mutate(line = lower(line)) %>%
  filter(instr(line, "sherlock") > 0) %>%
 [1] "late sherlock holmes, and yet discernible by a member of a race charged" 
 [2] "sherlock holmes."                                                        
 [3] "“uncle sherlock! the mean luck of it!--that he should come just"         
 [4] "another trouble presented itself. “uncle sherlock 'll be wanting to talk"
 [5] "flint buckner's cabin in the frosty gloom. they were sherlock holmes and"
 [6] "“uncle sherlock's got some work to do, gentlemen, that 'll keep him till"
 [7] "“by george, he's just a duke, boys! three cheers for sherlock holmes,"   
 [8] "he brought sherlock holmes to the billiard-room, which was jammed with"  
 [9] "of interest was there--sherlock holmes. the miners stood silent and"     
[10] "the room; the chair was on it; sherlock holmes, stately, imposing,"      
[11] "“you have hunted me around the world, sherlock holmes, yet god is my"    
[12] "“if it's only sherlock holmes that's troubling you, you needn't worry"   
[13] "they sighed; then one said: “we must bring sherlock holmes. he can be"   
[14] "i had small desire that sherlock holmes should hang for my deeds, as you"
[15] "“my name is sherlock holmes, and i have not been doing anything.”"       
[16] "late sherlock holmes, and yet discernible by a member of a race charged" 


gutenbergr package

This is an example of how the data for this article was pulled from the Gutenberg site:


gutenberg_works()  %>%
  filter(author == "Twain, Mark") %>%
  pull(gutenberg_id) %>%
  gutenberg_download(mirror = "http://mirrors.xmission.com/gutenberg/") %>%
  pull(text) %>%