Watson Studio Local is now part of IBM Cloud Pak for Data. Learn more Cloud Pak for Data.

Summary

This code pattern demonstrates how data scientists can leverage remote Spark clusters and compute environments to train and deploy a spam filter model. The model is built using natural language processing and machine learning algorithms and is used to classify whether a given text message is spam or not.

Description

This code pattern is a demonstration of how data scientists can leverage remote Spark clusters and compute environments from Hortonworks Data Platform (HDP) to train and deploy a spam filter model using Watson Studio Local

A spam filter is a classification model built using natural language processing and machine learning algorithms. The model is trained on an SMS spam collection dataset to classify whether a given text message is spam, or ham (not spam).

This code pattern provides multiple examples to tackle this problem, utilizing both local (Watson Studio Local) and remote (HDP cluster) resources.

After completing this code pattern, you’ll understand how to:

Load data into Spark DataFrames and use Spark’s machine learning library (MLlib) to develop, train and deploy the Spam Filter Model.
Load the data into pandas DataFrames and use Scikit-learn machine learning library to develop, train and deploy the Spam Filter Model.
Use the sparkmagics library to connect to the remote Spark service in the HDP cluster via the Hadoop Integration service.
Use the sparkmagics library to push the python virtual environment containing the Scikit-learn library to the remote HDP cluster via the Hadoop Integration service.
Package the Spam Filter model as a python egg and distribute the egg to the remote HDP cluster via the Hadoop Integration service.
Run the Spam Filter Model (both PySpark and Scikit-learn versions) in the remote HDP cluster utilizing the remote Spark context and the remote python virtual environment, all from within IBM Watson Studio Local.
Save the Spam Filter Model in the remote HDP cluster and import it back to Watson Studio Local and batch score, and evaluate the model.

Related work from others:  Latest from MIT : Defining the public interest in new technologies

Flow

The spam collection data set is loaded into Watson Studio Local as an asset.
The user interacts with the Jupyter notebooks by running them in Watson Studio Local.
Watson Studio Local can either use the resources available locally or utilize HDP cluster resources by connecting to Apache Livy, which is a part of the Hadoop Integration service.
Livy connects with the HDP cluster to run Apache Spark or access HDFS files.

Instructions

Get the detailed instructions in the README file. These steps will show you how to:

Clone the repo.
Create project in IBM Watson Studio Local.
Create project assets.
Commit changes to Watson Studio Local Master Repository.
Run the notebooks listed in each example.

Similar Posts