Semi-supervised learning frameworks for python, which allow fitting scikit-learn classifiers to partially labeled data
Both R and Python are incredibly good tools to manipulate your data and their integration is becoming increasingly important. The latest tool for data manipulation in R is Dplyr whilst Python relies on Pandas. In this blog post I’ll show you the fundamental primitives to manipulate your dataframes using both libraries highlighting their major advantages and disadvantages.
Probabilistic programming (PP) allows flexible specification of Bayesian statistical models in code. PyMC3 is a new, open-source PP framework with an intutive and readable, yet powerful, syntax that is close to the natural syntax statisticians use to describe models. It features next-generation Markov chain Monte Carlo (MCMC) sampling algorithms such as the No-U-Turn Sampler (NUTS; Hoffman, 2014), a self-tuning variant of Hamiltonian Monte Carlo (HMC; Duane, 1987). This class of samplers works well on high dimensional and complex posterior distributions and allows many complex models to be fit without specialized knowledge about fitting algorithms. HMC and NUTS take advantage of gradient information from the likelihood to achieve much faster convergence than traditional sampling methods, especially for larger models. NUTS also has several self-tuning strategies for adaptively setting the tunable parameters of Hamiltonian Monte Carlo, whicstatisticalh means you usually don’t need to have specialized knowledge about how the algorithms work. PyMC3, Stan (Stan Development Team, 2014), and the LaplacesDemon package for R are currently the only PP packages to offer HMC.
With 1.4 version improvements, Spark DataFrames could become the new Pandas, making ancestral RDDs look like Bytecode. I use heavily Pandas (and Scikit-learn) for Kaggle competitions. Nobody won a Kaggle challenge with Spark yet, but I’m convinced it will happen. That’s why it’s time to prepare the future, and start using it. Spark DataFrames are available in the pyspark.sql package (strange, and historical name : it’s no more only about SQL !). I’m not a Spark specialist at all, but here are a few things I noticed when I had a first try. On my GitHub, you can find the IPython Notebook companion of this post.
Despite having shown various ways to overcome D3 cartographic envy, there are always more examples that can cause the green monster to rear it’s ugly head.
I am back with another article. Today we will see how to add a guided step by step story-line to our D3 dashboard. We will be using another awesome open source java script library : Intro.js
Our vision is to make Kaggle the home of data science: the place to learn, compete, collaborate, and share your work. In a step aimed at making that vision a reality, we have rolled out an exciting new feature called Scripts, which allows data scientists to share and run code on Kaggle. Scripts also makes it easy to fork and build off each other’s work, promoting collaboration within the community.