Easier way to Analyze Big Data


Big Data Analytics is impressive. It gets you a high paying job. Your friends are in awe of you. There is good reason for this. Not everyone can do data analytics because you need to learn how to code. And coding takes a long time to learn. Consider the two heavyweight software used for Data Analytics - Python or R - both are powerful but it carries a steep learning curve.
But what if there is a new and better way?
Now there is! Have you heard of RapidMiner Go? It is a web based Data Analytics solution. It helps you analyze big data in a DIY way. You follow a few wizard guided steps. This is perfect for you if don’t want to learn coding. Drag and drop, "pindot pindot lang!"
Follow me as we go through how to use RapidMiner Go. First, we open our browser and visit: http://go.rapidminer.com/. We create an account and login. This is the main menu screen:
We then click on “Build a new predictive model” This takes up to the step.
For our tour, let us use their sample data. I click on the and in the dropdown, we select the only dataset: Churn Data.
From the name of the dataset, I assume this is a sample data that shows the various plans that customers took, and what happened to them later on. Did they leave the company? (Churn=yes) Or did they opt to stay (Churn=NO)? The screen shot is shown below:
MAIN1 MAIN2
Sample Dataset: Churn Database
From there, the screen asks for . I specified the column named . This is what we call the ‘labelled’ data. It is what the analytics software will try to PREDICT.
Choose: YES label
I’d like the predictive model to be more focused on the YES values of the Churn field. This will help create a model that predicts which of our customers are MORE likely to take their business elsewhere. Armed with a prediction of which customers are likely to leave, Business owners can then launch loyalty programs that aim to keep these clients to stay.
In the next step, I ignore the ‘Define Gains and Costs” for now. To keep this simple, I press NEXT to take us to the next step, which is to select the INPUTs that the predictive models will use to create the models.
 MAIN4
Looking at the table above, I would want to chose “CustServ Calls”. The system has already flagged this input column as . It has also pre-selected (checked) several other columns. As a newbie, I used to select ALL the inputs. But as you will learn with experience, MORE is not better. I’m going to unselect the other pre-selected columns whose correlations are less than 1%. Create a predictive model with less inputs speeds up the computation process. I usually also exclude columns that have a lot of missing values.
The choices I made looks like this:

Click on NEXT to get us to the next step - . For the uninitiated, the models are various ways/algorithms you can run. These are the ways to create the predictive models. Which ones you use depend on the circumstances, types of data and what you want to achieve. For now, I’m going to run ALL the models that are presented. I will look at the results later. I will decide based on the results.

Click on .

I enabled the "Explain predictions"

The RapidMiner Go will then proceed with its processing. As it runs, it will compute the predictive strength of the each of the algorithms. The higher the accuracy result, the better. (Generally speaking). But I caution that other than Accuracy metric, you need to also consider: classification error (lower is better), Precision (higher is better), Area Under the curve (closer to 1 is better) and Model Building time.

All things being equal, I would select the model with the shortest time.

Initial processing window
The results are out!
Decision Trees and Random forest method have highest accuracy
Both Decision Tree and Random Forest resulted in highest accuracy of 96.85%. To get more details on the two, we click on the Decision Tree icon on the left side:
What is the Decision Tree model ? (Scroll down the page)
Decision Tree Model
What this says is that if your customer had more than 6.5 customer service calls, then they are likely to churn (leave you)! The prediction is 96.85% accurate. As a call center manager, I would suggest a rule/trigger be put in place here. Any client that calls more than 4 or 5 times should be escalated to a supervisor to handle. Or a promo package be ready and handy to help dissuade this obviously dissatisfied client from leaving you.
Random Forest also arrived at the same results. But as the comparison shows, Decision trees are easier to calculate (13.95 seconds vs 100.52 seconds). So I would work with decision tree on this one.
Now let inspect the results in detail. Click on the ‘simulator icon’ on the left. Select Decision Trees, and drag the CustServ Calls to the right. This simulates a client with increasing number of calls to the company.
You will notice that as you drag the pointer past the 6.5 mark, the prediction will change from NO to YES. Dragging the other input columns doesn’t really change the prediction as much as CustSERV Calls.
Once we satisfied with the model. We can deploy it on our website. I can also it to my desktop RapidMiner
Export the Decision Tree Model
The exported Process is then loaded into my Desktop RapidMiner and looks like this:
To deploy it for “Production” we select “APPLY MODEL” and in the dropdown
Once the model is deployed as in the case above, your web just sends a http post request as shown below to https://go.rapidminer.com/am/api/deployments/2f09d128-2501-43c9-952d-b7d22b75ba85
{ "data": [ { "Int'l Plan": "No Plan", "Day Charge": 45.07, "Day Mins": 265.1, "Intl Mins": 10, "CustServ Calls": 4 } ] }
It will return with the predictive result. Cool right?
Now go and try it out yourself. Let me know how it goes for you! Drop me your comments and suggestions. Or better yet, join us at our Facebook Group https://www.facebook.com/groups/108560036239757