PythoneeR
After using a lot of R for analytics projects believing that it was the best language for Data Scientists, I recently had the chance to pick up Python. R does seem a bit cumbersome when dealing with interfaces to other languages or to the web such as oauth. That was my motivation, to use Python to get text from the web and later process it in R, which was, I felt the “best” tool to go about.
using the httr package to retrieve data from apis in R
For a project I’m working on, I needed to access residential electricity rates and associated coordinate information (lat/long) for locations in the US. After a little searching, I found that data.gov offers the rate information in two forms: a static list of approximate rates by region and an API, which returns annual average utility rates ($/kWH) for residential, commercial, and industrial users. Most of my project work will take place in R so I thought why not see how well APIs interact with it. I came across the “httr” package, which for my purposes, worked extremely well.
For this tutorial, we are only going to look at the GET() command in httr. You can view the full list of functions in the httr package here. The GET() command will access the API, provide it some request parameters, and receive an output. For the Utility Rate API, the request parameters are api_key, address, lat, and lon. You can request the API Key from the previous link. Format (json or xml) is technically a request parameter as well. For the purposes of this tutorial we will be using json for all requests.
Numerical Optimization: Understanding L-BFGS
Numerical optimization is at the core of much of machine learning. Once you’ve defined your model and have a dataset ready, estimating the parameters of your model typically boils down to minimizing some multivariate function $f(x)$, where the input $x$ is in some high-dimensional space and corresponds to model parameters. In this post, I’ll focus on the motivation for the L-BFGS algorithm for unconstrained function minimization, which is very popular for ML problems where ‘batch’ optimization makes sense. For larger problems, online methods based around stochastic gradient descent have gained popularity, since they require fewer passes over data to converge. In a later post, I might cover some of these techniques, including my personal favorite AdaDelta.
Advice for applying Machine Learning
This post is based on a tutorial given in a machine learning course at University of Bremen. It summarizes some recommendations on how to get started with machine learning on a new problem. This includes
• ways of visualizing your data
• choosing a machine learning method suitable for the problem at hand
• identifying and dealing with over- and underfitting
• dealing with large (read: not very small) datasets
• pros and cons of different loss functions.
The post is based on “Advice for applying Machine Learning” from Andrew Ng. The purpose of this notebook is to illustrate the ideas in an interactive way. Some of the recommendations are debatable. Take them as suggestions, not as strict rules.
Big Data: The 4 Layers Everyone Must Know
• Data sources layer
• Data storage layer
• Data processing/ analysis layer
• Data output layer
10 Modern Statistical Concepts Discovered by Data Scientists
1. Clustering using tagging or indexation methods
2. Bucketization
3. Random number generation
4. Model-free confidence intervals
5. Variable / feature selection and data reduction
6. Hidden decision trees
7. Jackknife regression
8. Predictive power and other synthetic metrics
9. Identification of true signal in data subject to the curse of big data
10. New data visualization techniques
11. Better goodness-of-fit and yield metrics
Understanding the Now – The Role of Data in Adaptive Organizations
The ability to collect, store, and process large volumes of data doesn’t confer advantage by default. It’s still common to fixate on the wrong questions and fail to recover quickly when mistakes are made. To accelerate organizational learning with data, we need to think carefully about our objectives and have realistic expectations about what insights we can derive from measurement and analysis.
A solution for classification rules management toward actionable analytics
The analytics community has long been discussing whether analytics is about art or science. Analytics is more an art than a science in its ability to form conditions to drive business toward an action that is based on the confidence that the action will improve business performance. This ability to be actionable have recognized recently as the most important aspect in analytics. The concept is known as Prescriptive Analytics shares some similar statements with Actionable Analytics, but some meaningful differences are present as well.