Which features should you use to create a predictive model? This is a difficult question that may require deep knowledge of the problem domain. It is possible to automatically select those features in your data that are most useful or most relevant for the problem you are working on. This is a process called feature selection. In this post you will discover feature selection, the types of methods that you can use and a handy checklist that you can follow the next time that you need to select features for a machine learning model.
Most popular programming languages have one thing in common – they are all “Open source”. Open source is a decentralised development model which is based on community participation. The community members contribute to the development of the programming language and these contributions are publicly available to be accessed by anyone. Community participation is the prime reason for continuous development and innovation in these open source languages like R, C++, C#, Java, PHP, Python, Ruby, etc. For data science, R is one of the most popular language. The main reason for its popularity is continuous contribution and support from R practitioners in the data science community. These packages form the backbone of R programming language. While a lot of tutorials are being shared across on solving problems using R, the open source development gets lesser attention. For me, creating a package and giving it back to the community meant a huge thing. It was my way to starting to give back and I know this is just the start. In order to help expand the community further, I decided to write an article about the process of package creation and how to contribute a package to open source R community. Also, we are going to create a package and contribute it to the open source community.
Financial time series analysis and their forecasting have an history of remarkable contributions. It is then quite hard for the beginner to get oriented and capitalize from reading such scientific literature as it requires a solid understanding of basic statistics, a detailed study of the ground basis of time series analysis tools and the knowledge of specific statistical models used for financial products. Further, the financial time series ecosystem is not one of the most easiest flavour you may encounter. Trends are typically transitory as driven by underlying random processes, non stationarity, heteroscedasticity, structural breaks and outliers are rather common. All that has driven the adoption of sophisticated models and simulation techniques which require good understanding and expertise to take advantage of.
There are many situations where we find that our code runs too slow and we don’t know the apparent reason. For such situations it comes very handy to use the python cProfile module. The module enables us to see the time individual steps in our code take, as well as the number of times certain functions are being called. In the following paragraphs, I will explore it’s capabilities a bit more.
Linear Regression and ANOVA concepts are understood as separate concepts most of the times. The truth is they are extremely related to each other being ANOVA a particular case of Linear Regression. Even worse, its quite common that students do memorize equations and tests instead of trying to understand Linear Algebra and Statistics concepts that can keep you away from misleading results, but that is material for another entry. Most textbooks present econometric concepts and algebraic steps and do empathise about the relationship between Ordinary Least Squares, Maximum Likelihood and other methods to obtain estimates in Linear Regression. Here I present a combination of little algebra and R commands to try to clarify some concepts.
Now I’ll show another example to continue the last example from Part 1 and I’ll move to something involved more variables.