Dialog is a domain-specific language for creating works of interactive fiction. It is heavily inspired by Inform 7 (Graham Nelson et al. 2006) and Prolog (Alain Colmerauer et al. 1972). An optimizing compiler, dialogc, translates high-level Dialog code into Z-code, a platform-independent runtime format originally created by Infocom in 1979.

Machine Learning: Dimensionality Reduction via Linear Discriminant Analysis

A machine learning algorithm (such as classification, clustering or regression) uses a training dataset to determine weight factors that can be applied to unseen data for predictive purposes. Before implementing a machine learning algorithm, it is necessary to select only relevant features in the training dataset. The process of transforming a dataset in order to select only relevant features necessary for training is called dimensionality reduction. Dimensionality reduction is important because of three main reasons:
• Prevents Overfitting: A high-dimensional dataset having too many features can sometimes lead to overfitting (model captures both real and random effects).
• Simplicity: An over-complex model having too many features can be hard to interpret especially when features are correlated with each other.
• Computational Efficiency: A model trained on a lower-dimensional dataset is computationally efficient (execution of algorithm requires less computational time).
Dimensionality reduction therefore plays a crucial role in data preprocessing.

Natural Language Interface to DataTable

You have to write SQL queries to query data from a relational database. Sometimes, you even have to write complex queries to do that. Won’t it be amazing if you could use a chatbot to retrieve data from a database using simple English? That’s what this tutorial is all about.

Data Literacy: Using the Socratic Method

How can organizations and individuals promote Data Literacy? Data literacy is all about critical thinking, so the time-tested method of Socratic questioning can stimulate high-level engagement with data.

Designing Tools and Activities for Data Literacy Learners

Data-centric thinking is rapidly becoming vital to the way we work, communicate and understand in the 21st century. This has led to a proliferation of tools for novices that help them operate on data to clean, process, aggregate, and visualize it. Unfortunately, these tools have been designed to support users rather than learners that are trying to develop strong data literacy. This paper outlines a basic definition of data literacy and uses it to analyze the tools in this space. Based on this analysis, we propose a set of pedagogical design principles to guide the development of tools and activities that help learners build data literacy. We outline a rationale for these tools to be strongly focused, well guided, very inviting, and highly expandable. Based on these principles, we offer an example of a tool and accompanying activity that we created. Reviewing the tool as a case study, we outline design decisions that align it with our pedagogy. Discussing the activity that we led in academic classroom settings with undergraduate and graduate students, we show how the sketches students created while using the tool reflect their adeptness with key data literacy skills based on our definition. With these early results in mind, we suggest that to better support the growing number of people learning to read and speak with data, tool de- signers and educators must design from the start with these strong pedagogical principles in mind.

A Data and Analytics Leader’s Guide to Data Literacy

Imagine an organization where the marketing department speaks French, the product designers speak German, the analytics team speaks Spanish and no one speaks a second language. Even if the organization was designed with digital in mind, communicating business value and why specific technologies matter would be impossible. That’s essentially how a data-driven business functions when there is no data literacy. If no one outside the department understands what is being said, it doesn’t matter if data and analytics offers immense business value and is a required component of digital business.

Being right matters : model-compliant events in predictive processing

While prediction errors (PE) have been established to drive learning through adaptation of internal models, the role of model-compliant events in predictive processing is less clear. Checkpoints (CP) were recently introduced as points in time where expected sensory input resolved ambiguity regarding the validity of the internal model. Conceivably, these events serve as on-line reference points for model evaluation, particularly in uncertain contexts. Evidence from fMRI has shown functional similarities of CP and PE to be independent of event-related surprise, raising the important question of how these event classes relate to one another. Consequently, the aim of the present study was to characterise the functional relationship of checkpoints and prediction errors in a serial pattern detection task using electroencephalography (EEG). Specifically, we first hypothesised a joint P3b component of both event classes to index recourse to the internal model (compared to non-informative standards, STD). Second, we assumed the mismatch signal of PE to be reflected in an N400 component when compared to CP. Event-related findings supported these hypotheses. We suggest that while model adaptation is instigated by prediction errors, checkpoints are similarly used for model evaluation. Intriguingly, behavioural subgroup analyses showed that the exploitation of potentially informative reference points may depend on initial cue learning: Strict reliance on cue-based predictions may result in less attentive processing of these reference points, thus impeding upregulation of response gain that would prompt flexible model adaptation. Overall, present results highlight the role of checkpoints as model-compliant, informative reference points and stimulate important research questions about their processing as function of learning und uncertainty.

The Divergence Index: A new polarization measure for ordinal categorical variables

In the statistical literature, for ordinal types of data, are known lots of indicators to measure the degree of the polarization phenomenon. Typically, many of the widely used measures of distributional variability are defined as a function of a reference point, which in some ‘sense’ could be considered representative for the entire population. This function indicates how much all the values differ from the point that is considered ‘typical’. Of all measures of variability, the variance is a well-known example that use the mean as a reference point. However, mean-based measures depend to the scale applied to the categories (Allison & Foster, 2004) and are highly sensitive to outliers. An alternative approach is to compare the distribution of an ordinal variable with that of a maximum dispersion, that is the two-point extreme distribution (i.e A distribution in which half of the population is concentrated in the lowest category and half in the top category). Using this procedure, three measures of variation for ordinal categorical data have been suggested, the Linear Order of Variation – LOV (Berry & Mielke, 1992), the Index Order of Variation – IOV (Leik, 1966) and the COV (Kvalseth, Coefficients of variations for nominal and ordinal catego…). All these indices are based on the cumulative relative frequency distribution (CDF), since this contains all the distributional information of any ordinal variable (Blair & Lacy, 1996). Consequently, none of these measures rely on ordinal assumptions about distances between categories.

Disposable Technology: A Concept Whose Time Has Come

Imagine…imagine that you have been challenged to play Steph Curry, the greatest 3-point shooter in the history of the National Basketball Association, in a game of 1×1. Yea, a pretty predictable outcome for 99.9999999% of us. But now image that Steph Curry has to wear a suit of knight’s armor as part of that 1×1 game. The added weight, the obstructed vision, and the lack of flexibility, agility and mobility would probably allow even the average basketball player to beat him. Welcome to today’s technology architecture challenge!

Deep Knowledge: Next Step After Deep Learning

Data science has been around since mankind first did experiments and recorded data. It is only since the advent of big and heterogenous data that the term ‘Data Science’ was coined. With such a long and varied history, the field should benefit from the great diversity of perspectives that are brought by practitioners from different fields. My own path started with signal analysis. I was building high speed interferometric photon counting systems, where my ‘data science’ was dominated by signal to noise and information encoding. The key aspect of this was that data science was applied to extend or modify our knowledge (understanding) of the physical system. Later my data science efforts focused on stochastic dynamical systems. While the techniques and tools employed were different than those used in signal analysis, the objective remained the same, to extend or modify our knowledge of a system.

How to Increase the Impact of Your Machine Learning Model

Typically, industry machine learning projects aren’t based on a fixed, preexisting reference dataset like MINST. A lot of effort goes into procuring and cleaning training data. As these tasks are highly project-specific and can’t be generalized, they are rarely talked about and receive no media attention. Similar is true for the post-modelling steps: How to bring your model into production? How will the model outputs create actual business value? And by the way, shouldn’t you have been thinking about these questions beforehand? While the model serving workflows are somewhat transferable, monetization strategies are usually specific and not made public. With these considerations, we can paint a more accurate picture.

Five Command Line Tools for Data Science

One of the most frustrating aspects of data science can be the constant switching between different tools whilst working. You can be editing some code in a Jupyter Notebook, having to install a new tool on the command line and maybe editing a function in an IDE all whilst working on the same task. Sometimes it is nice to find ways of doing more things in the same piece of software. In the following post, I am going to list some of the best tools I have found for doing data science on the command line. It turns out there are many tasks that can be completed via simple terminal commands than I first thought and I wanted to share some of those here.

Genetic Artificial Neural Networks

Artificial Neural Networks are inspired by the nature of our brain. Similarly Genetic Algorithms are inspired by the nature of evolution. In this article I propose a new type of neural network to assist in training: Genetic Neural Networks. These neural networks hold properties like fitness, and use a Genetic Algorithm to train randomly generated weights. Genetic optimization occurs prior to any form of backpropagation to give any type of gradient descent a better starting point. This project can be found on my GitHub here, with an explanation in the snippets below.

Inferring New Relationships using the Probabilistic Soft Logic

The main purpose of data is to deliver useful information that can be manipulated to take important decisions. In the early days of computing, such useful information could be queried from direct data such as those stored in databases. If the information queried for was not available in those data-stores, then the system would not be able to respond to the users’ queries. Consequently, at present, if we consider the mass amount of data that is present on the web and the exponentially growing number of web users, it is hard to predict and store what exactly these users will be querying for. Simply stated, we cannot manually hard-code the responses to each and every expected question from the users. So what is the solution? The best answer is to extract knowledge from data, refine and recompose that into a knowledge graph, which we can use to answer queries. However, knowledge extraction has proven to be a non-trivial problem. Hence, we use a statistical relational learning methodology that considers past experiences and learns new relationships or similarities between facts in order to extract such knowledge. The Probabilistic soft logic(PSL) is one such statistical relational learning framework that is used for building knowledge graphs.

Probabilistic Soft Logic (PSL)

Probabilistic soft logic (PSL) is a machine learning framework for developing probabilistic models. PSL models are easy to use and fast. You can define models using a straightforward logical syntax and solve them with fast convex optimization. PSL has produced state-of-the-art results in many areas spanning natural language processing, social-network analysis, knowledge graphs, recommender system, and computational biology. The PSL framework is available as an Apache-licensed, open source project on GitHub with an active user group for support.

You Are Having a Relationship With a Chatbot!

Mindful strategies to recognize, and stop that chatbot from getting too close. In the age of chatbot proliferation, we are often wondering who we are talking to? When we chat with customer service representatives online, we expect the first line customer service representatives to be chatbots. We tell them about our problems. Chatbots help us find a solution. The ‘problem and solution’ nature of customer service presents a classic problem best solved by artificial intelligence. Taking this one step further, what if you are a twenty-something career driven person who has literally no time during your busy packed working day to check in with your significant other. You’ve worked hard to earn your significant other’s love. You want to juggle your relationship with your career. But, there’s just no time.

Distributed Deep Learning Pipelines with PySpark and Keras

In this notebook I use PySpark, Keras, and Elephas python libraries to build an end-to-end deep learning pipeline that runs on Spark. Spark is an open-source distributed analytics engine that can process large amounts of data with tremendous speed. PySpark is simply the python API for Spark that allows you to use an easy programming language, like python, and leverage the power of Apache Spark.