Election 2016: The Battle of the Statisticians...

10 posts

Macrobius

More from this fellow: http://statisticalideas.blogspot.com/2016/10/sea-of-faulty-polls.html

His argument is correct (and saying much the same as Taleb with a different vocabulary) -- he is basically arguing that the 'house bias' effect should either be included in the variance (error bars) and if it were properly so included it would raise Trump's chances from 15% [like Nate Silver says] to 30% -- which is not a small number or a certain loss-- or else we are not in Mediocristan at all with these polls. The model error is *much* greater than a theoretical normal distribution.

A bit of background: Statisticians (and Econometricians) know that 'not everything is a Gaussian Bell Curve' like they teach in your first stats course. A much more plausible model is a linear model (GLM) and that comes in two flavours, 'Random Effects' and 'Fixed Effects'. The point of the discussion below is that a Random Effects model is not plausible. There are formal tests from this, but the effects cannot be drawn from the same distribution. This leaves the Fixed Effects GLM as a possibiilty -- there is a huge partisan bias to the polls -- but then the model uncertainty inherent in an FE GLM is not being quoted by the media. This guy guesses a factor of two understatement in the error bars shown, which is quite plausible given these data.

Another thing my Statistician's eye catches is that the second graph does not show a proper bias / variriance tradeoff. It is not unexpected that the polls are biased (not to Statistician it isn't). Nor is it surprising that the data are 'heteroskedastic' (variance gets larger with the bias effect). What is surprising is (1) the heteroskedasticity is not funnel shaped but *hour glass shaped* and (2) there is no evidence that the bias is buying anything in terms of reduced variance, which is a legitimate tradeoff. These are two red flags for any competent analyst.

It looks like partisan bias is driving *more* variance, not less. Translation for the layman: these polls are pure bullshit put out in the interest of partisans from both sides. Trump's odds (in the popular vote) are a coin toss maybe better. Nate Silver is, as Taleb points out, full of it, because he is not doing maths anymore. What a correct analysis along the lines of simulating the electoral vote would look like, no one can say because no one has published one yet (that we've seen, anyway). Silver's stuff is Garbage In, Garbage Out.

https://en.wikipedia.org/wiki/Heteroscedasticity
https://en.wikipedia.org/wiki/Bias–variance_tradeoff

Anyway, standard long excerpt to spare you the click :D

Rest at the link.
Macrobius
Augusto Pinochet
Early Voting Points to a Tight Election

The election has already begun, voting has started, and guess what: polls which predicted a Hillary landslide are dead wrong. I can say one thing for certain: Hillary will not win in a landslide. And I think there’s a 50% chance she won’t even win. The early voting results paints a mixed, competitive picture of both good news and bad news.

First, the good news. Trump is up in Iowa without any adjustments and Republicans are doing better in early voting than they were in 2012 in Iowa. Georgia is not a battleground state , with Trump up 3 in the RCP average without any adjustments. Maine CD2 is probably going to go to Trump, but it’s unlikely to matter.

The bad news is out west. Trump is matching Romney’s results in Nevada, and Colorado is simply a lost cause. Trump will win Arizona, Romney won it by 9.03%, and Trump is doing well enough in early voting that he’s on track to win it, but he’s doing worse than Romney. Also, Virginia is probably not really a battleground state anymore . It’s gone.

But the really potential good news – and the way Trump can win the election, is a turnout boom in Ohio, Pennsylvania and possibly Michigan. This is where the data is exciting and is going to decide the election.
Broseph
This is pretty much how I see it, but I think election day turnout will win NV for Trump. He's winning every "must win" state. He just needs NV and one other and it's a done deal. Remember that polls had Clinton beating Sanders by 21% but Sanders won by 1%. Right now, polls show her about 5% ahead of Trump, FWIW.
Macrobius
http://www.thephora.net/forum/showthread.php?t=112542

I don't have time to explain this right now, but this is important

http://statisticalideas.blogspot.com/2016/10/pollsters-gone-wild.html

He's going down the right path here by looking at the bias, variance relation (error analysis) of the polls -- this is what real data analysts who know what they are doing think about. He also does a bootstrap, which is the right thing to do in a classical (90s-ish) analysis, and builds a 0/1 classifier (which is the aughties way to do it).

Bias and variance is lecture 9 of this mid-aughties set of course notes:

http://web.engr.oregonstate.edu/~tgd/classes/534/

The 'classical' part of his analysis shows Trump at 20% probability of winning the *popular* vote (it is too bad he doesn't repeat Silver's analysis of the electoral college, then we would all know)

Key finding is that the Liberal polls have a >=0.1% bias for some reason.

Now what all this means I will try to address later. I would caution the mathematical people here not to just to rash conclusions such as 'we should just throw out the Liberal polls'. Downweighting information *all the way to zero* is seldom the optimal course, even if it is a rough heuristic. We can do better.

There are ways (*cough* Van Kalman filter *cough*) to recover signal from noisy data without selling yourself short.

I wish he'd gone further and applied the (post 2000) analysis of Pedros Domingos (who is also author of The Master Algorithm which I've been recommending lately).

http://homes.cs.washington.edu/~pedrod/papers/mlc00a.pdf

Anyone who wants to get at the truth of these polls will read that paper *very* carefully and maybe apply what they learn.

Bias Variance analysis is *absolutely* the way to go -- but we need to think in terms of 'learning a binary classifier' for states, given the polls as training data. Think of '1' as votes for Trump, and '0' as anything else.

Here's a starter:

- a bernoulli trial is a sequence of 1s and 0s (like coin flips). predicting the electoral college means

- bad predictions have a 'loss' function which is brutal -- if you get a state wrong, add "1" to your loss function, otherwise "0".

- now find a learning algo that beats everyone else's loss function, even Nate Silver's, and predict the election for us :D

This is easier than it sounds -- the amazing thing is that no one has actually done it *correctly*. They are all off chasing last years' delusions.
Macrobius

*predicting the election means getting 54 or so bernoulli trials correctly predicted.

Macrobius

Anyone wanting to do real stats should pay attention to Andrew Gelman (main popularizer of MCMC in the 90s, and hence the starter of the Bayesian conquest before Machine Learning). Atari era PC based stats. Great stuff.

http://www.stat.columbia.edu/~gelman/research/unpublished/swing_voters.pdf - 'The Mythical Swing Voter'

No evidence sharks swing elections http://andrewgelman.com/2016/10/29/no-evidence-shark-attacks-swing-elections/

Mr P can solve problems with survey weighting http://andrewgelman.com/2016/10/12/31398/ (So why don't the Pollsters bother? [**])

and of course he has a book, Red State/Blue State.

[**] I divide statisticians into 3 classes -- C class statisticians have not qualifications at all. 90% of the people you meet in industry have an MBA or at best have taken one course in their STEM careers. All journalists, psychology practisioners, many economists who aren't econometricians, and some popularisers (Paulos) are C class. B class statisticians might have a Ph.D., but probably couldn't get an academic job at a good university, with prejudice. They tend to use outdated methods and not quite understand what they are doing. This makes them a bit dangerous. They are found in polling organizations, less seldom in industry except as consultants. In the land of commercial C class, the B class statistician is king. A class statisticians (and econometricians) are simply that -- competent, current, likely to have an academic post and the respect of their peers. Taleb, Gelman are popular examples. I'm not sure Nate Silver is A class -- he seems a little bit quackish to me but he could be A class and dumbing down his explanations. He gets a lot of flack from the real A class sort.

My categories are not hard to follow. People use this classifaction for doctors all the time. C class is 'persons with no qualification playing at being doctor'. B class are quacks. A class are real doctors.[/PDF]

Macrobius

err could a mod fix the above poast? It ended up syntactically malformed somehow.

Broseph
it was the
Code:
[PDF]
tag.

And thank you for this thread. I am reading.
Macrobius

So let's get this done.

Analysis has to start with 'error analysis'. Why? Because modern science started sometime in the 19th century, in German laboratories. It spread to England and America, mostly by person to person contact (apprenticeship). German science was bombed into oblivion in 1945, but some German scientists migrated to the US, and we had science here to by then. And this is what you do. You start designing experiments with error analysis.

There are three kinds of error -- systematic error, random error (instrumental measurement error) or noise, and statistical error (sampling error, variance).

For example, if you want to measure how long a stick is, you measure it several times with a micrometer or some other calibrated setup. You will get a distribution with a central tendency (estimated by the mean) and some measure of variation around that. For lots of measurements, that distribution will tend to a gaussian (bell curve) -- if you live in mediocristan and sticks being measured as to length do.

So, no matter what you measure you end up:

Loss function = bias (systematic error) + variance (sampling error) + noise (instrumental measurement error).

The overall loss rate in an election is a killer -- you end up miscalling a state, and your prediction of electoral votes is off.

Salil Mehta gives the classical version of this formula (for a mean square error loss function -- which we can only use in Mediocristan):

prediction error = bias^2 + variance of errors + irreducible errors

what does 'irreducible errors' mean? Every measurement has what is called 'ground truth' - the stuff you don't question but just accept. You have to start somewhere or face an infinite regress. Ground truth for an election has to be the tally of votes you count. Sure, it is possible to make innocent or not innocent errors in counting (fraud). Voting machines can be programmed to flip votes, and illegal voters can be less than ideal.

Let's borrow a concept from 'evidence based medicine'. When we are considering a diagnostic test, there is a 'golden diagnostic test' -- the best available procedure for making an objective diagnosis. We say some other test has failed only if we have a Verifiable Operational Procedure for knowing the right answer (and the right answer is our golden diagnostic procedures).

So here is how we deal with fraud in elections: we use court ordered recounts and *certify* the results. The certified election result is our first last and only defence against the scum of the universe. it is our golden data set. Let's remember this when we start winging about exit polls next week.

So what is noise? Once we know the operation procedure for ground truth, we can make verifiable claims of error -- noise.

Noise is anything that causes the certified outcome of the election to differ from what the court said that voter

y = f(x) + epsilon

x is the point we are sampling (a voter). f is the function that puts a *label* on the voter. Labels are what the court says is the truth after judging all the evidence. it is ground truth. y is what the voter intended to do.

So, if there is fraud, or a miscount, or an evil Maxwell's demon in the voting machine, it causes epsilon -- a difference between y (what the voter intended to do, and likely says they did to an exit pollster), and f(x) what the court labels the outcome of the attempted vote.

So, irreducible error means *noise* and noise (measurement error) includes miscounting, fraud. it doesn't count things we don't like about reality not being ideal. We might wish that x, an illegal immigrant, would not be allowed to vote by showing just an ID. But we live in a universe where x voted, the court ordered f(x) is a label for Clinton, and y is how x intended to vote. There is no noise there.

In may discussions, noise is presumed to be zero (we assume undetected fraud and miscounts approved by the courts do not in fact occur). This is likely a reasonable assumption.

Now, for Pedro Domingo's general decomposition, we need some concept (applied to a set of data such as a poll, from which we are going to try to learn f, and predict the election given x* -- the people who actually voted. We want Y* = f(x*) -- the *best prediction* of the outcome is y*.

In addition, there is the 'main prediction' -- which depends on the loss function. In classical statistics, we use the loss function of mediocristan (mean square error) and the 'main prediction' is the mean. In fat tail territory, we use mean average (absolute) deviation (MAD) and the 'main prediction' is the median. In 1/0 loss function land (our current analysis) we use the *mode*. Thus, in Illinois, the main prediction is the label ym, 'Clinton'. In Texas, the main prediction ym is 'Trump' currently. That is because either 1s or 0s are the mode -- whoever has the most voats.

Bias, for Domingos, is L(ym, y*) -- the difference between the best prediction (election outcome) and the main prediction (who got the most voats in the poll). If the polls don't predict the election, they are biased. The measure of bias is the average loss function. (Note that in Mehta's classical treatment Bias is called Bias-squared. That is an artifact of the metric, and we call Bias^2 the Bias term. There is no square in the 1/0 version of the formula)

Variance, similarly, is the loss between y and ym (the poll data and the main prediction). It is decomposed into two sorts -- the average variance in the biased case (where loss is 1), and the average variance in the unbiased case (where loss is 0).

So, setting Noise to zero,

L = <B> + Vu - Vb

(the prediction error rate will be the average Bias, in the definition of Domingos, + Vu - Vb where those terms are, respectively, the average variance for the unbiased case, and the average variance for the biased case)