Sunday, June 28, 2020

Understanding Statistics

Understanding statistics goes beyond understanding the mathematical aspects.

We are bombarded with statistics every day.  Everyone wants to get a point across, financially, politically, medically, or whatever - and are willing to use statistics to prove their point.   At first, statistics seem like an irrefutable argument - the ultimate cite to authority, expertism run wild.  How can you argue with the numbers?

Well, as it turns out, a number of ways.   Believe it or not, some statistics are entirely made-up.  Yes, act shocked, people lie to get their point across.   But even if they aren't lying, the raw data they may be using to get their point across may be flawed.

Telephone surveys, for example, are problematic, in that they only survey people with telephones or people who answer telephones.  Back in 1948, many people in rural areas didn't have phone service, so the newspapers called the election for Dewey, as their urban readers when surveyed by phone, all claimed to be supporting that New York Govenor.

And of course, that is what they claimed.   Surveys are the worse sort of statistics of all, not only in the way they are accumulated, but in how people self-filter by answering them.  The idiot who fills out a ten-page survey on his car-buying experience has filtered himself out as the kind of chump who answers ten-page surveys.   And people lie in their answers - again with the lying!   We lie to ourselves and we lie on surveys.  "How many drinks do you have a day, 0-1, 1-2, 2-3, or 3 or more?"  Oh, 1-2 surely.

So, raw data and the way it is collected is the first area where statistical data can be deeply flawed.  How it is processed and displayed is a second.   For example, in the "gun violence" chart I discussed earlier, not only was the data apparently faked, the criteria used - "gun deaths among wealthy countries" made the US appear to be an outlier.  But when you factor in all countries, we come out behind Mexico and Montenegro, among others.   This is not to say gun violence is not a problem in the US (and apparently in Montenegro) but that false comparisons to make a point don't really make the point - but actually discredit your argument.

Similarly, how charts are prepared can make trends look more alarming than they are. A chart that shows a stock price skyrocketing doesn't show the scale (or does in small numbers).  It is only when you realize the Y-axis begins at $20 and has a spread of ten cents, and the X-axis is showing a timeline of the last ten minutes, that you realize the "big spike in share price!" is nothing more than statistical noise.

Noise is something that most people don't understand.   As an Electrical Engineer, you know about it - the background static that appears in any signal, if from nothing else, echos from the big bang, or solar flares or whatever. There will always be minor statistical variations in any signal, and it is often easy to confuse these with meaning.  A share price goes up or down a few fractions of a percent - it may mean nothing, or it may be a start of a trend.   You don't really know until you accumulate more data and that takes time.   Trading on some transient condition can result in disaster.

But there are other ways to skew the presentation to make your point.  I noted before that not putting data in terms of per capita is a way of making it seem more alarming.   The US has more deaths due to Corona Virus than anyone else!   But in terms of per capita (at least at the time) the rates were far lower than Italy or Spain.  Sort of an insult to the dead in those countries, no?    The media loves to publish charts showing the cumulative deaths or infections from the virus - charts which by their very design will always show numbers shooting up, up, up.

The chart above is the "our world in data" Corona Virus death chart.  I like to use this chart to see how things are going, as it seems these folks have no dog in the fight.   But they can only chart the data that they get from sources.  And it turns out, each country or even State reports these statistics differently.   So even though it is comforting to look at this chart and see the deaths going down, we really have no way of knowing whether this really is true or not.  And the recent "spike" in the chart illustrates why - several States (and I can only guess which ones) have upped their body count recently to include "probable" deaths, which is an interesting concept.   So the data spikes for one day, but that one-day spike doesn't represent a sudden flood of bodies in the morgue, but rather numbers added based on guesswork, to a one-day total, from "probable" deaths over a period of time.

The infection rate data has the same problem, and yes, President Trump is right in that if you test more, you will find more infections (and I am sure the real infection rate could be far, far higher than we presently report).   But the media dismisses this out of hand, while at the same time saying the same thing - that so many people are undiagnosed.  They dismiss what Trump has to say, as they want to do a story about infections "spiking" because of millenials romping on the beach in Florida, and anything that disagrees with that narrative is dismissed.

It is sad, but both sides of the political spectrum are making hay from this nonsense. Florida, for example, is trying to force New Yorkers to quarantine for 14 days if they enter the State (is that legal?).  Not to be outdone, in a Red State-Blue State tit-for-tat, Governor "mob boss" Cuomo is now requiring residents from Florida to quarantine for 14 days upon entering New York. I'm glad both sides have put aside petty politics in this time of crises.

Like I said before, good data is hard to come by.   We like to believe the government is expert in accumulating and processing this data, but the reality is it is an inexact process, and we need to take these numbers and their methodology with a grain of salt