**Creating, Validating and Pruning a Decision Tree in R**

In this blog we will discuss :

1. How to create a decision tree for the admission data.

2. Use rattle to plot the tree.

3. Validation of decision tree using the ‘Complexity Parameter’ and cross validated error.

4. Prune the tree on the basis of these parameters to create an optimal decision tree.

1. How to create a decision tree for the admission data.

2. Use rattle to plot the tree.

3. Validation of decision tree using the ‘Complexity Parameter’ and cross validated error.

4. Prune the tree on the basis of these parameters to create an optimal decision tree.

This is site is meant to serve as a directory for the amazing content the community has created around the Data Science Specialization.

**7 Questions Every Data Scientist Should Be Answering for Business**

1. What problem are we trying to solve?

2. Does the approach make sense?

3. Does the answer make sense?

4. Is it a finding or a mistake?

5. Does the analysis address the original intent?

6. Is the story complete?

7. Where would we head next?

2. Does the approach make sense?

3. Does the answer make sense?

4. Is it a finding or a mistake?

5. Does the analysis address the original intent?

6. Is the story complete?

7. Where would we head next?

**Analyse TB data using network analysis**

In a very interesting publication from Jose A. Dianes on tuberculosis (TB) cases per country it was shown that dimension reduction is achieved using Principal Component Analysis (PCA) and Cluster Analysis (http://…/data-science-with-…). By showing that the first principal component corresponded mainly to the mean value of TB cases and the second mainly to the change over the used time span, it become clear that the first two PCA-components have a real physical meaning. This is not necessarily the case for PCA constructs an orthogonal basis, by making linear combinations of the original measurements, of which the eigen vectors are orederd in a decending order. Though, this method may not work with data having different types of variables. The scripts in this article are written in R.

**The Unreasonable Effectiveness of Random Forests**

• Random Forests require almost no input preparation.

• Random Forests perform implicit feature selection and provide a pretty good indicator of feature importance.

• Random Forests are very quick to train.

• Random Forests are pretty tough to beat.

• It’s really hard to build a bad Random Forest!

• Versatility.

• Simplicity.

• Lots of excellent, free, and open-source implementations.

• Random Forests can be easily grown in parallel.

• Random Forests perform implicit feature selection and provide a pretty good indicator of feature importance.

• Random Forests are very quick to train.

• Random Forests are pretty tough to beat.

• It’s really hard to build a bad Random Forest!

• Versatility.

• Simplicity.

• Lots of excellent, free, and open-source implementations.

• Random Forests can be easily grown in parallel.

**Deming and Passing Bablok Regression in R**

In this post we will be discussing how to perform Passing Bablok and Deming regression in R. Those who work in Clinical Chemistry know that these two approaches are required by the journals in the field. The idiosyncratic affection for these two forms of regression appears to be historical but this is something unlikely to change in my lifetime-hence the need to cover it here. Along the way, we shall touch on the ways in which Deming and Passing Bablok differ from ordinary least squares (OLS) and from one another.