10 tools and platforms for data preparation

1. Paxata
2. Alteryx
3. Lavastorm
4. SAP Lumira
5. Platfora
6. Teradata Loom
7. DataWatch
8. Datameer
9. Tamr
10. Rapidminer Studio

A Python implementation of LightFM, a hybrid recommendation algorithm.

The LightFM model incorporates both item and user metadata into the traditional matrix factorization algorithm. It represents each user and item as the sum of the latent representations of their features, thus allowing recommendations to generalise to new items (via item features) and to new users (via user features). The model can be trained using four methods:
• logistic loss: useful when both positive (1) and negative (-1) interactions are present.
• BPR: Bayesian Personalised Ranking [1] pairwise loss. Maximises the prediction difference between a positive example and a randomly chosen negative example. Useful when only positive interactions are present and optimising ROC AUC is desired.
• WARP: Weighted Approximate-Rank Pairwise [2] loss. Maximises the rank of positive examples by repeatedly sampling negative examples until a rank violating one is found. Useful when only positive interactions are present and optimising the top of the recommendation list (precision@k) is desired.
• k-OS WARP: k-th order statistic loss [3]. A modification of WARP that uses the k-th positive example for any given user as a basis for pairwise updates.

Jupyter and conda for R

Jupyter, previously called IPython, is already widely adopted by data scientists, researchers, and analysts. Jupyter’s notebook user interface enables mixing executable code with narrative text, equations, interactive visualizations, and images to enhance team collaboration and advance the state of reproducible research and training. Jupyter began with Python and now has kernels for 50 different languages, and the IRKernel is the native R kernel for Jupyter. Data scientists, researchers, and analysts use the conda package manager to install and organize project dependencies. With conda they can easily build and share metapackages, which are downloadable bundles of packages. Conda works with Linux, OS X, and Windows, and is language agnostic, so we can use it with any programming language and with projects that depend on multiple languages. Let’s use conda and Jupyter to start a data science project in R.

Bayesian approach to compare hypotheses about human trails on the Web: Hyptrails

This ipython notebook provides a basic tutorial regarding the HypTrails approach. It utilizes the Python implementations provided at https://…/HypTrails and https://…/PathTools. HypTrails is a Bayesian approach that allows to compare hypotheses about human trails on the Web. Fundamentally, HypTrails is based on a first-order Markov chain model. Hypotheses are expressed as belief in parameters of the model. Then, HypTrails incorporates these hypotheses as elicited Dirichlet priors into a Bayesian inference process. The relative plausibility of hypotheses then is determined by their relative marginal likelihoods and Bayes factors.

New R Software/Methodology for Handling Missing Data

I’ve added some missing-data software to my regtools package on GitHub. In this post, I’ll give an overview of missing-data methodology, and explain what the software does. For details, see my JSM paper, jointly authored with my student Xiao (Max) Gu.