Axibase Time-Series Database (ATSD)
Axibase Time-Series Database (ATSD) is a next-generation statistics database built for companies that need to extract value from large amounts of time-series data which exists in their IT and operational infrastructure. By supporting industry-standard protocols for ‘push’ and ‘pull’ data collection, ATSD enables customers to instrument their information assets, systems and sensors for visibility, control, and operational intelligence. ATSD is capable of providing actionable insights which can be escalated as alerts as well as delivered via real-time infographics assembled into role-based dashboards for senior management, operations management, and support teams. To simplify deployments, ATSD integrates with existing monitoring tools and data sources for incremental import of bulk data for quick time-to-value. ATSD is a complete all-in-one solution with built-in features including metric storage, analytical rule engine, alerting, forecasting and visualization.
5 Painful Prediction Pitfalls
Predictive applications require applied data science and cannot live outside of their application domain. Watching out for these common prediction pitfalls will help you deliver predictive applications that are relevant to the business and help your customers make (or save) money:
1. Forecast what you can control.
2. Beware of false aggregations.
3. The absence of proof is not the proof of absence.
4. Consider the cost of uncertainty.
5. There are many wrong ways to calculate forecast quality.
dplyr Use Cases: Non-Interactive Mode
The current release of dplyr (v 0.4.1) offers lot more flexibility regarding usage of important verbs in non-interactive mode. In this post, I’m exploring different possible use-cases.
Let’s build open source tensor libraries for data science
Tensor methods for machine learning are fast, accurate, and scalable, but we’ll need well-developed libraries. Data scientists frequently find themselves dealing with high-dimensional feature spaces. As an example, text mining usually involves vocabularies comprised of 10,000+ different words. Many analytic problems involve linear algebra, particularly 2D matrix factorization techniques, for which several open source implementations are available. Anyone working on implementing machine learning algorithms ends up needing a good library for matrix analysis and operations. But why stop at 2D representations?
Part 3b: EDA with ggplot2
In Part 3a I have introduced the plotting system ggplot2. I talked about its concept and syntax with some detail, and then created a few general plots, using the weather data set we’ve been working with in this series of tutorials. My goal was to show that, in ggplot2, a nice looking and interesting graph can usually be created with just a few lines of code. Given the positive feedback that post has received, I believe I had some success with what I had in mind. Who knows if some of the readers will decide to give ggplot2 a try using their own data? It is often said that a picture is worth a thousand words, but I often see some highly experienced R users, after doing really complex data analyses, plotting the results in a way that falls short of the expectations, given the quality of their analytical work. In Part 3b, we will continue to rely on visualisations to explore the data, but now with the goal of tackling the following question: Are there any good predictors, in our data, for the occurrence of rain on a given day? The numeric variable representing the amount of rain will be our dependent variable, also known as response or outcome variable, and all the remaining ones will be potential independent variables, also known as predictors or explanatory variables.
The Hitchhiker’s Guide to the Hadleyverse
The ‘Hadleyverse’ is the collection of R packages developed by Hadley Wickham, including tools for data manipulation, plotting, creation of packages, etc. But be patient, we will see them in detail later. The ‘Hadleyverse’ concept implies that by loading and working with those packages you can not only extend the features of base R, but even change the way you code and the strategy you follow to analyse your data (for example, both plyr and dplyr packages follow the ‘split-apply-combine’ rule).
The Grammar of Data Science
Python and R are popular programming languages used by data scientists. Until recently, I exclusively used Python for exploratory data analysis, relying on Pandas and Seaborn for data manipulation and visualization. However, after seeing my colleagues do some amazing work in R with dplyr and ggplot2, I decided to take the plunge and learn how the other side lives. I found that I could more easily translate my ideas into code and beautiful visualizations with R than with Python. In this post, I will elaborate on my experience switching teams by comparing and contrasting R and Python solutions to some simple data exploration exercises.