**Why Isn’t Everything Normally Distributed?**

Adult heights follow a Gaussian, a.k.a. normal, distribution. The usual explanation is that many factors go into determining one’s height, and the net effect of many separate causes is approximately normal because of the central limit theorem. If that’s the case, why aren’t more phenomena normally distributed? Someone asked me this morning specifically about phenotypes with many genetic inputs. The central limit theorem says that the sum of many independent, additive effects is approximately normally distributed. Genes are more digital than analog, and do not produce independent, additive effects. For example, the effects of dominant and recessive genes act more like max and min than addition. Genes do not appear independently – if you have some genes, you’re more likely to have certain other genes – nor do they act independently – some genes determine how other genes are expressed. Height is influenced by environmental effects as well as genetic effects, such as nutrition, and these environmental effects may be more additive or independent than genetic effects. Incidentally, if effects are independent but multiplicative rather than additive, the result may be approximately log-normal rather than normal.

**A Speed Comparison Between Flexible Linear Regression Alternatives in R**

Everybody loves speed comparisons! Is R faster than Python? Is dplyr faster than data.table? Is STAN faster than JAGS? It has been said that speed comparisons are utterly meaningless, and in general I agree, especially when you are comparing apples and oranges which is what I’m going to do here. I’m going to compare a couple of alternatives to lm(), that can be used to run linear regressions in R, but that are more general than lm(). One reason for doing this was to see how much performance you’d loose if you would use one of these tools to run a linear regression (even if you could have used lm()). But as speed comparisons are utterly meaningless, my main reason for blogging about this is just to highlight a couple of tools you can use when you grown out of lm(). The speed comparison was just to lure you in. Let’s run!

**Hash Table Performance in R: Part I**

A hash table, or associative array, is a well known key-value data structure. In R there is no equivalent, but you do have some options. You can use a vector of any type, a list, or an environment. But as you’ll see with all of these options their performance is compromised in some way. In the average case a lookupash tabl for a key should perform in constant time, or O(1), while in the worst case it will perform in O(n) time, n being the number of elements in the hash table. For the tests below, we’ll implement a hash table with a few R data structures and make some comparisons. We’ll create hash tables with only unique keys and then perform a search for every key in the table.

**Correlation and R-Squared for Big Data**

With big data, one sometimes has to compute correlations involving thousands of buckets of paired observations or time series. For instance a data bucket corresponds to a node in a decision tree, a customer segment, or a subset of observations having the

**Hierarchy of Language**

Geller’s Differential Linguistics and his patents are studied, they are shown as the development of Wittgenstein’s Tractatus Logico-Philosophicus and Moore’s Truth and Falsity. The only hierarchy of language – text’s context and surrounding it multiplicity of subtexts – is thoroughly examined; nature of text’s predicative phrases, clauses, sentences and paragraphs is clarified. The novel approach toward Knowledge is proposed: Knowledge is Nothing, singularity; there is the Nothingness of silence, none-predicative definition and text. The practical proofs of Differential Linguistics as the first ever verified Humanitarian theory are demonstrated. Geller suggests that only one hierarchy in language exists: text’s context and surrounding it multiplicity of subtexts; where they always are composed by sets of paragraphs, sentences; clauses and predicative definitions.

**R: Optimal Binning for Scoring Modeling**

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.