A Walk-through the Rapid Miner

Handcrafted by yours truly, Hariharan.

Hariharan M
18 min readOct 16, 2018

Please find the blog on Medium. It’ll feel much real to read this on Medium.

https://medium.com/@vshshv3/a-walk-through-the-rapid-miner-921dfaf53722

In this simple blog post, I’ll pace through my experience with Rapid Miner, a tech that is currently the buzz in the Big Data industry. Everybody uses it, simply because it offers rapid prototyping and easy applicability to serve a user-friendly integration of various data mining techniques. You simply drag drop stuff onto the canvas and run through the phases of data mining. As simple as that.

This tasks I’ll cover for this assignment will pre-dominantly come from Machine Learning, my passionate choice of study :)

Introduction to Rapid Miner

Rapid Miner is a Data Science Platform for quickly analyzing data. Meant largely for non-programmers and researchers. You have a load of data. You have idea in your mind. And you easily create processes, import data into them, run them over and throw a prediction model. RM supports porting ML models to web apps(flask or nodeJS), android, iOS, etc, which is why it unifies the entire spectrum of the Big Data Analytics Lifecycle.

One place for everything. Courtesy: Rapid Miner Website

The Best part about this is, it simplifies the various scattered tasks of data mining and analysis. This is a place to load data (from anywhere, say Hadoop, Cloud, RDBMS, NoSQL, pdf, etc,etc), pre-process and prepare data using standard industrial methods (group items by categories or spawn new child tables or join tables or interpolate missing data, etc), train AI models (even train optimal deep learning models: Random Forests, XGBoost, Gradient Boost, etc) or clustering or pruning outliers, to even visualizing outputs.

Finally you easily deploy these models on the cloud or in the production environment. Just need to create user interfaces to collect real time data and run it on the trained model to serve a task. I specialize in android :), and RM’s wrapper to android was so easy to grasp than even some of the deep learning frameworks’ (like keras or tensorflow).

How good is RM over other much established Big Data Tools, like those from Apache Foundation?

We’ll answer this question after the demo. You’ll then come to see the brilliance of RM, that stands out unique.

Lets quickly head on to the rapidminer.com and install. I checked out on a 30 day trial.

The features of Rapid Miner: Extensive Demo on the Titanic Dataset

I’ll hereby discuss the best to expect out of RM. It’ll be lengthy. Took me sometime, but really worthwhile.

1. The Best Thing about RM: Turbo Prep

We all know that we spend too much time on data prep and not as much time as we want on building cool machine learning models. Everybody in the analytics industry is slowed down by clunky data preparation tools and almost waste 80% of their time in just prepping data, whilst the effort can fruitfully go into processing and analysis.

This is where the Auto Model and Turbo Prep come in.

A nice functionality for data preparation, called RapidMiner Turbo Prep, is where you simply drag and drop data to create amazing interfaces.

I’ll explore and demonstrate some of the possibilities of this cool feature as well as bring out how well it integrates with RapidMiner Auto Model. These form the two most striking uniqueness of RM as compared to other Big Data tech.

a. Load the data into RM and Inspect

Welcome to the RM Studio on my PC. Studio is the IDE (Integrated Dev Env) where all the factory ground work is staged.

The RM Studio

Lets import some data. We can use the RM’s pre-installed repository to quickly load the Titanic dataset, which contains a comprehensive list of all the passenger’s information of the Doomed Titanic cruiser.

Note: We’ll import data into the Turbo Prep for easy pre-processing and then create models piping the cleansed data out of Turbo.

How you do it: — Turbo Prep-> Load Data- > Titanic

Import Titanic

This is how RM gives us a peep into some of the quality indicators. The histogram and stats distributions are seen at the top of each feature (column). We can use this to determine if that feature is suitable for ML or not.

Titanic Dataset

Example: The column Cabin has too many NILs to even begin with. Not useful for ML or modelling.

b. Transforming Data

First step is to look at the data in aggregates. Pivot table does that.

Pivot Table

The pivot table is used to quickly combine the features and group the records by some set of categories. There’s a row Group-By and a Column Grouping. Within these groups we can place a numerical attribute and aggregate stats based on any agg. operation: sum or avg or count or percentile, etc

Visual analysis of the Deck.

From my table above, I see that, 144 females from the First Class have survived. Almost 493 males from the Third passenger class have survived the glacier crash. A closer look at the grouping strategy reveals a 2nd dimension to Survived. I’ve grouped by % survived out of the entire count in that category. By that, the 2nd class female are the least saved.

This is contrary to what was shown in the film. It was said that the males had to leave room for the female and the life boats only took in females, but their survival rates are overwhelmed by males.

A lot other analysis stuff can just come from this grouping step. Quite easy ah.

This new chart weaves a story as:- Average Passenger Fare for the Third Class is just 12.135 euros for people who didn’t survive, but the cost of a First class life is 117 euros!!! :) :)

b. Filter out

The pivot table helped us quickly determine what we should be focusing on- Passenger Class, Fare, No of Siblings or parents or children aboard, sex and the most important target var: Survived or not. With this fabric in mind, lets delve further.

As far as the machine learning goes, there is no room for nulls. Need to filter them out first. I’ll opt the ‘?’ — nulls out from the table using the FILTER transform.

Select the ‘Cabin’ column, right click, and select ‘Transformations’, then ‘Filter’. Immediately, there will be a preview of the data, so you know whether the results are correct. All the statistics and quality measurements are updated as well.

Condition for filtering. Keep records != ‘?’

c. Sort

From a variety of data transformation controls, we can choose to use any. For instance, sort by age, ascending.

Sort records by age.

We may choose to sample say, only 100 records out of these 295 rows. Again, just click on the SAMPLE tab, and choose 100. You get a randomly shuffled sample of 100 records!! Just like that…

d. Statistics

Just hover over the fields to show up what each one has got to say.

Mean passenger fare= 81 euros

Side Note: Later on, when we cleanse data, we will come to see this. The correspondence between the Ticket Number and the family names. Every member in the family gets the same ticket number. So, it’s enough to keep just one of these 2 fields. Maybe the ML for predicting survived or not, learns this fact: ‘If a group of persons have the same Ticket number, they belong to a family and if one of them died, then probably all of them died.’

e. Writing Formulaes

We see that the No of Siblings and the No of parents or spouse aboard the ship for that person is being used as an attribute. Why not better merge the 2 into a single field for ‘No of relatives’. Just of sum of these two, element-wise.

Compactness in ML can aide to enhancement in running times, useful esp. for real-time analytics.

Generate a new column on the total num of relatives for a person out of these 2, by selecting the 2 cols and right click to formulae.

Formulae for a[i] = b[i]+c[i] for all i

The purple is the newly generated field. The Greens are the recipes for this new field. The rest of the fields fall in place as shown.

After committing these changes to the table,

We’ll go further by removing the 2 additional columns. Would only add to the junk.

After all this hefty processing,

New table

f. The Charts

Human eye is perceptive. A picture can easily reveal stuff that a number can’t. TurboPrep has built-in support to every swiftly make plots on the go.

Total num of relatives Vs Ticket no
Plot settings in the Charts panel

It doesn’t take much to see the stories this chart screens. Lets not worry ourselves too much with the Survived or Died question :). On the bright, ahppy side of things, of all the people who boarded, most of them were singles. And of all the rest, most families were two or at the most 3 people.

Contradicts the film that shows First Class families of over 5 or 6 people in each. :)

g. Cleansing

We can’t just pass the entire junk to our ML model. We’ll choose the best of features that will aid in the survived/dead prediction and drop the rest.

For instance, we can drop the cabin feature. If the entire ship is about to sink, then obviously no matter what your cabin inside the ship is, you still sink :) :)

Load the transformed data into the Cleanser
Here’s the controls

Every phase in this step has got a very brief description. This is to help first timers. And this indeed helped me!

Phase 1: Select the target variable to predict. In our case, Survived.

Target var

Phase 2: To improve the quality of our prediction of survived or not, we can drop all the index fields. A name in England: a family name, first name and title, is almost unique. So it serves as an index variable for this entire table. Every person’s name is a key!! I’ll drop that. It doesn’t contribute to saying whether he/she died. Next, the ticket number groups all the members in a family. We may choose to keep it or drop it. Since RM says it’s a weak feature, I decipher that it can’t perform efficiently with that feature in the mix. So, going by the AI, I choose to drop the same.

Drop features

Phase 3: Whether you wanna change the datatypes of the rest of the fields? NO

We keep the same data types in each column

Phase 4: We perform standard dimensionality reduction tricks and data normalization., only on the numerical attributes. We leave it to the RM’s concern. It used Principal Component Analysis for the same.

PCA and norm

Phase 5: Commit the entire operation and save the cleansed data into local repository or on disk.

After cleansing,

h. Other powerful features for handling multiple tables

Just to illustrate another powerful feature of Turbo, its got a Merge/Join feature where you can merge columns or join 2 tables.

Example: Customer DB and transaction DB. Works just like how the conventional RDBMS, joins 2 tables on a foreign key. I won’t do this anyway. Since, I’ve pruned my data accordingly.

So, I’ll just do a demo on this, anyway, though it doesn’t concern with the ML tasks ahead. But I’ll not commit the changes to my already cooked dataset, which is cleansed and set aside for ML.

Load another pre-processed data for this task. There’s a titanic training dataset in the repo, that I’ll just use for demo purposes.

A Titanic Training demo dataset

Click on the Merge Tab.

Choose to merge this with Titanic dataset. The degree of coherence shows how much is this dataset relevant to the its getting merged with. 100% shows, that the 2 are very much alike.

And after specifying the join keys, we merge them.

Note that there are no unique keys on other side to join with. Because, both are cleansed data for ML and are stripped off indexes.

Anyway, this feature is pretty powerful for other SQL related or Data Viz. related tasks.

That ends Feature 1: Turbo Prep.

We’ll continue working with our transformed Data. Thank you Turbo Prep for all the cool help.

Process and History Tabs: A quick Look.

Here’s the data in form. But how did we get here?

Processes: Every object on the RM canvas represents a transformation or an operation on the data. The sequence of Ops we did all through this exercise is infact implemented as a graph or a chain of changes, piped one after the other. Internally the pipeline runs on the RM’s processor engine as follows:-

The process flow

If it weren’t for TurboPrep’s easy interface, we would be right now, doing this on our own. Linking every transformation node to node.

Thats saved us a lot of work.

Process chart continued.

Also we can take a look at the history of operations on the Turbo’s Controls.

2. The Next, the Best Thing about RM: Auto Model

The RapidMiner Auto Model automates machine learning and accelerates Data Science. This interactive feature makes RM more accessible to new users and more powerful for expert Data Scientists.

Now that we have our data, it just takes a few more clicks to create Deep learning models or Ensembles. Yea, the world of AI is just clicks away for an absolute beginner. Nevertheless, a useful tool for even the advanced users.

Me, an enthusiast of Deep learning AI found it very useful!

AutoModel delivers new capabilities by transforming data to generate actionable insights, without compromising on transparency or control.

Here’s another pipeline of tasks:-

a. Select Task

Decide what type of problem you would like to solve.

  • Predict: This will build a machine learning model which predicts the values of this target based on the values of the other predictors.
  • Clusters: Groups data into clusters.
  • Outliers: This finds unusual points in your data. The goal here is to find individual data points that are far away from all other data points, possibly because of errors in data collection.

Our task is the prediction. Remember? Whether a give person Survived or not?

Predict

This will be a Classification task, as the target is categorical.

Click on predict.

b. Prepare Target: What is the class of highest interest?

Well, we want to know if that person come out alive. Thats a positive sense to this classification task. Not the other way round (the person died, didn’t he?).

c. Select the input columns that go into the ML model.

As our entire data was pruned and set especially for ML tasks, we can choose of the columns here.

Note: The Principal components are in the classification task. They’ve consumed the total #relatives, etc.

Attrbs shown by RM

Help from RM: Status

The colored status bubble provides a quality indicator for a data column.

  • Red: A red bubble indicates a column of poor quality, which in most cases you should remove from the data set.
  • Yellow: A yellow bubble indicates a column which has either a very low (<0.01) or a very high correlation(>0.5) with the target column.

Quality Bars

The color of the status bubble is based on the following quality measures, displayed as bars together with each Attribute: Correlation, ID-ness (measures the degree to which this Attribute resembles an ID.), Stability (how stable or constant is this column), Missing.

d. Select Model Types

Choose one the available predictor models in ML. The shown models are synced to match our data’s properties/distribution. Example, it understands that our task is classification and only suggest classifiers, not regressors.

ML models

I choose to run this prediction on all of the available models:-

  • Naive Bayes: a simple and fast probabilistic classifier based on Bayes theorem
  • Decision Tree: finds simple tree-like models which are easy to understand
  • Gradient Boosted Trees (XGBoost): powerful but complex model using ensembles of decision trees
  • Generalized Linear Model (GLM): generalization of linear regression models
  • Logistic Regression: a widely used statistical method for binary classification
  • Deep Learning: multi-level neural network for learning non-linear relationships

This is the awesome thing that I’ve ever been through. All of my dream, favorite models, all in one place!!

Results: Classification

This is the final step of Auto Model, where you can inspect the generated models together with other results.

Here’s a comparison of the various models used for prediction. It shows their accuracies and runtimes. Also compares their ROC curves as follows, all together on a single chart:-

ROC Comparison

With regard to ROC, we just need to know that the closer a curve is to the top left corner, the better the model is. And it is only shown for two-class problems.

Models

All other sections in the results menu are reserved for the models. Each model gets a section of its own and in general provides the entries below.

Example for Naive Bayes Classifier, the model looks as,

Simulator: provides an easy-to-use, real-time interface to change the inputs to a model and view the output.

We can set our own example inputs and run the forward prop to generate the output class.

Give in your own test data and see.

It shows predictions, confidences, and explanations for those inputs.

Performance-wise results: lists the model’s prediction accuracy and other performance criteria.

This was the classification report with the confusion matrix- precision and recall for the Naive Bayes model. Higher the precision and recall, better the classifier.

Here’s a plot view of the performance.

The optimal params are also thrown for each model. The Gradient Boost trained 100 trees, and the performance was the best at the depth 7.

Of all the models thus simulated, the Gradient Boost was the best. Because, it had to train for an extended period of time to learn optimal hyper-parameters.

There are a lot of other cool things we can do with these simulator outputs. Auto Model made it look like it was nothing. But lets continue with the Task at-hand. Rapid Miner!!!

Feature 3 of RM: Cool Process View

Process view captures all that we have done graphically on the Auto Model UI. This largely helps to translate the entire process to production. Forms as a base for further work or product releases in the industry.

Feature 4 of RM: Real-time scoring

Predict at scale, with very low latency, and deliver actionable intelligence in real-time to the decision maker(the user) or machine, like predicting how your customers behave, when your industrial parts will break, or calculating the risks associated with an action, etc.

Feature 5: SPEED>>>

Parallelization of operators with Loop and Optimize Parameters, FP Growth, Join and other built-in operators makes it lightning fast to work with complex data imports and perform feature engineering steps with a few clicks.

Feature 6: Advanced ML capabilities and algorithms

The RM team keep adding numerous extensions and support libraries thae expand out data science capabilities to cover more use cases than ever. They are even working on catering to new domains of ML research: text mining, Deep Learning, or integration with R, Python, Weka and more.

Feature 7: The Collaboration with many Data Service Providers — MapR, Informatica, Microsoft HDInsight, AWS, Google cloud connector (Splunk, Hive connectors as extensions)

Users can easily connect to data, no matter where it lives, with out-of-the box connectors to many 3rd party applications, social media sources, data bases and more.

Courtesy: Rapid Miner Blog

Feature 8: The X factor — Stability

The platform is so so so user-friendly. The RM team is constantly fixing bugs and adding speed and power to the operator toolbox. This new release, they have featured their expanded coverage and visibility of Studio’s powerful ‘Quick fix’ feature, which provides you a solution for the most common errors.

Gone are the days when Users debug code themselves. You write an erroraneous script, leave the rest to RM.

Feature 9: Exploiting the Hadoop Distributed FS and distributed application manager to process any parallel code to run in any node in the cluster

RapidMiner Radoop is another extended servcie that expands on Big Data use cases

We have help for Anomaly detection, Windowing, Discretization. These operators enable you to identify fraud, detect abnormal consumer or machine behavior in your data on Hadoop, reduce overfitting of machine learning models and improve their predictive performance, time series forecasting and more.

Also, the new intuitive UI for configuring the settings and variables of your Hadoop cluster makes it easier test each section of the Map reduce independently.

Feature 10: The BEST Part of RM

The help. When you are lost, their built-in tutorials are everywhere in the software. They explain you if don’t know what is a Gradient Boost ensemble, etc, etc. They focus on an absolute beginner level.

You needn’t even go to Google after installing RM for data science.

Problems with Rapid Miner

After all the good talk, here I am to discuss some of the issues with this tool. The point I’m trying to make is that, ultimately, if a tool satisfies its role, earns all the credibility for user experience, then job done. That was the motto of their CEO. To capture user satisfiability.

RM was the Top rated Big Data tool on G2, with a 4.5 star rating.

Courtesy: G2 website

There was recently some buzz on the internet of RM being slow in loading MBs of data.

https://it.toolbox.com/question/rapidminer-issues-importing-csv-091510

Come-on, when you can save your data in Hadoop, just call it from anywhere, the live streaming and the world of cluster and cloud computing awaiting out there, trying to load MBs of csv on a 32 bit windows is too primitive.

The only problem with RM is that, the built-in algorithms are designed to run in optimized settings. For an expert user, he may want to further customize the model, maybe fine-tuning the parameters and stuff, for his research purposes. Still, the level of transparency in RM right now, is not too good for model tuning.

Conclusion

Here is a quick summary of all that I did in this extensive walk-through the Rapid Miner.

  1. We loaded data into Turbo Prep, processed using the Pivot tables, Filters and cleansed for ML tasks.
  2. We fit various ML models on the data, simulated different predictor classifiers and compared each of them. The GBoost was the best among all.
  3. Finally, all along we explored why RM was better than any other Big Data tool and its features.

Overall, I conclude that, with newer features like TurboPrep and AutoModel, only released in RM Studio 9, the RM team is making technology and technical expertise accessible to a common man. The academic license for 1 year is an added benefit for students, who learn better at visual perception. Further work in Visual Hadoop is under-way. Still, the RM is evolving day-to-day, becoming faster, stronger and better.

--

--