Comparing Tokenizer vocabularies of State-of-the-Art Transformers (BERT, GPT-2, RoBERTa, XLM). If someone used word embeddings like Word2vec or GloVe, adapting to the new contextualised embeddings like BERT can be difficult. In this story, we will investigate one of the differences: subword tokens. The inspiration for this story was a similar post that explored the multilingual BERT vocabulary.
As business moves into a new decade, the 2020’s, enterprises continue to look for the leg up that will push them above the competition. For years now, artificial intelligence has been one of the critical technologies that can help promote businesses to the next level, becoming smarter, more efficient, and eventually more profitable. But despite the growth of the term ‘artificial intelligence’ into our modern lexicon, it’s not actually as commonly used as one would think among advanced industries. A survey for the O’Reilly ‘AI Adoption in the Enterprise’ report found that just under 75 percent of respondents said their business was either evaluating ‘AI’ or not yet using ‘AI’, leaving only a quarter of industries such as financial services, healthcare, telecommunications, and electronics and technology with fully-fledged and operational artificial intelligence systems.
There is general recognition of a reproducibility crisis in science right now. I would venture to argue that a huge part of that does come from the use of machine learning techniques in science.’ – Genevera Allen, Professor of Statistics and Electrical Engineering at Rice University The use of machine learning is becoming increasingly prevalent in the scientific process, replacing traditional statistical methods. What are the ramifications of this on the scientific community and the pursuit of knowledge? Some have argued that the black-box approach of machine learning techniques is responsible for a crisis of reproducibility in scientific research. After all, is something really scientific if it is not reproducible?
High performing teams hold the key for the successful performance of any company. Whether you have thousands of employees or just five employees, high performing teams are a must for optimal business performance. Successful analytics initiatives are no exception and are also dependent on high performing teams. However, most Data Analytics teams today are a shadow of the old MIS/BI team structure and typically reporting into the CFO function. These teams are organized around the specific IT skills that is often a combination of ETL (Extract-Transform-Loading) Developers who build and maintain a data-marts and data-warehouses, Business analysts who capture the needs of business users for operational and BI reports, and report builders who run queries and build reports. In addition, the most of the current Data Analytics teams have been reporting to the Finance function which is typically a cost controlling and regulatory function in most companies and typically averse to innovation and change. What is needed in the new environment where every company is a data company is an operating model that focuses on innovation, scale & value creation. This environment in-turn will empower knowledge workers in the business with right tools and technologies to achieve their objectives.
Taxes and random forest again?’ Thomas, my colleague here at STATWORX, raised his eyebrows when I told him about this post. ‘Yes!’ I replied, ‘but not because I love taxes so much (who does?). This time it’s about tuning!’ Let’s rewind for a moment: in the previous post, we looked at how we can combine econometric techniques like differencing with machine learning (ML) algorithms like random forests to predict a time series with a high degree of accuracy. If you missed it, I encourage you to check it out here. The data is now also available as a CSV file on our STATWORX GitHub. Since we covered quite some ground in the last post, there wasn’t much room for other topics. This is where this post comes in. Today, we take a look at how we can tune the hyperparameters of a random forest when dealing with time series data.
How a seemingly straightforward operation in NumPy turns into a nightmare with TensorFlow.
In this blog post, I will discuss the alternative to using GANs more precisely. I would suggest the readers read the GANs to understand and compare the strategy explained below. Let us get started.
In a recent quest to ‘get my sh*t together’ I began by identifying cornerstone goals – the big, widely impacting components of existence that could effectively make or break the rest of my world. When listing these goals I immediately wrote out ‘be more effective at work’. This has been one of my greatest struggles (and potentially an area for improvement with the greatest potential reward). My lack of professional effectiveness manifested itself it a myriad of ways: though my actual work product was above average, it was at the expense of working late nights and weekends, grinding myself to exhaustion during the workday, finding myself constantly in a state of near-frenzy that felt at odds with the culture around me. The way I was working wasn’t… working.
My cousin is a senior in high school, and when I saw her recently I inquired, ‘How are your classes going?’ She rolled her teenage eyes and expressed her displeasure at all the vocabulary she had to learn for her statistics class. ‘Ordinal, nominal, … who cares?’ she lamented. Well, Anna, I’ll tell you. Data scientists care very much about ordinal and nominal data. For those not in a high school statistics class, let’s review: categorical data fall into two groups – ordinal and nominal. Ordinal categorical data are non-numerical pieces of information with implied order – for example, survey responses on a scale from very dissatisfied to very satisfied. And nominal categorical data are non-numerical pieces of information without any inherent order – for example, colors or states. The distinction between ordinal and nominal data was extremely important in building my most recent linear regression model predicting housing prices. The way I prepared my features for modeling depended largely on the data type. Below is a subset of four of the features from my model – first floor square footage, type of dwelling, kitchen quality, and neighborhood.
Time series models are used to forecast future events based on previous events that have been observed (and data collected) at regular time intervals.’ We will be taking a small forecasting problem and try to solve it till the end learning time series forecasting alongside.
I invented a new type of layer for neural networks: It gives the network the ability to critically assess the reliability of its own features, to enable more informed decision making. This invention is a spin-off of a private research project of mine. Preliminary results showed that using this layer lead to a minor improvement on the MNIST dataset. However, the improvement was too small to know for sure if it is useful. Since this project does not go in the same direction as my primary research, I am publishing it here. Maybe someone else wants to have a look at it. Even if it turns out that the improvement on MNIST is a fluke, there are a large number of future improvements that could make the algorithm more effective. Also, if you have any practical problems or benchmarking testcases lying around, be sure to give it a try and report if it worked. It’s just one line of code to swap a normal linear layer with a self-assessment layer.
Storytelling is one of the important ways humans communicate with one another. To tell stories about data, we use data visualizations, or more simply put – graphs. There are some terrible graphs out there, and we’ve all seen them. But there are even more graphs that aren’t quite terrible, but are bad nonetheless. Scrolling through my Facebook feed the other day, I came across an article from the South Florida Sun Sentinel about the politics of guns. The leading data visualization caught my eye – but not in a good way. The colors were bothersome, things were too far apart to make visual comparisons, and the style of graph was less than ideal.
When I first started in the data science field, I heard about these mythical models called Random Forests. I couldn’t wait to learn more. If that sounds like you, read on.