Thursday, December 31, 2015

Big Data, Bad Data

More than ever, it seems, data, statistics, surveys, polls, ratings, and reviews are flung in our face.  Are any of these good indicia of what reality is?

In my last few postings, I was ranting (and yes, I rant) about shitty surveys, statistics, and polls.  And if you came to this blog looking for ruminations on finances, you might wonder why I talk about this.

Well, as I have noted before, it is not hard to succeed in America, if you act rationally in an irrational world.    If you can perceive reality as it is, rather than as you'd like it to be, you can make rational choices in life and get ahead.

On the other hand, if you live in a fantasy-world, you will continue to make poor choices in life (in every sense of the word) and end up miserable.

The problem is, how do we go about perceiving reality?   And compounding this problem is that there are a whole lot of forces in the world that want to skew your perception of reality.

Who am I talking about?   Merchants.  Marketers.  Religions.  Politicians.  Con Artists.   The usual suspects.

They want to distort your vision of reality so they can get at your pocketbook and get you to do something against your own self-interest.

And surveys, polls, statistics, studies, online reviews, awards, credentialism, and whatnot are used to persuade you of things that would be plainly untrue to the average six-year-old.   But they get away with it - a lot.   We are lied to on a regular basis.

The three problems with surveys, statistics, and polls is that an awful lot of bad data is collected, it is processed poorly, and then improper conclusions are drawn from it.

For example, in my last posting, some grad students at Cornell set out to prove that people order more food when they have a fat waitress.   Right off the bat you see the problem - they have a conclusion they want to reach, and set out to find data to prove it.   The data sample size was pathetically small, based on a small region of the country (not nationwide) and based on visual observation not on actual measurement.   The data is poor.

The processing of the data, I can find no fault with, as it appears they used standard statistical processes - although the correlation does seem a little weak.

And the conclusions - well they are forgone, as the set out to prove the conclusion, which tainted the whole study.

This is, of course, typical of most bad data out there.  And historically, most data has been bad.   Consider surveys and polls.   A.C. Nielsen has provided "ratings" for decades on television shows (and radio shows) based largely on self-reported data in the form of "diaries" kept by a self-selected number of users.   Today they have improved somewhat by using an electronic "Nielsen box" - but even those are problematic.

I was once asked to be a "Nielsen family" and it was interesting.   They sent me a diary and I was supposed to fill it out.   The initial selection of course, was apparently random.  Or was it?   It was random based on an address database they had no doubt purchased.   So right off the bat, people not on an address database (people who move often, such as college students) are not selected.

Second, I had to respond to the request.   A lot of folks would toss the whole thing in the garbage.   You had to desperately want to be a Nielsen family to be selected.   So there is a second level of filtering.   Busy people - people with good incomes - don't have time to fill out "diaries" every day.

The third level really filtered people out - and it filtered me out.   I was told to keep a "diary" for a couple of weeks, and then they would evaluate my diary and see if I qualified.   Since I was working three jobs and going to night school, I didn't fill out the diary regularly, so I was filtered out.   The fact that I didn't have time to watch much television probably didn't help matters any - they wanted people who watched a lot.  An empty diary doesn't tell them much, does it?

Yet, for years, the television networks lived and died by this data, which of course in my mind, is highly suspect, only because the raw data is skewed.   And the Nielsen people know this, which is why they are always looking for better ways to track viewership and listenership.

The mother of all poor polling incidents.   You would have thought we would have learned from that!

Polls work the same way.   We live and die by poll data, it seems.  But in the last few Presidential elections, it seems the polls have been off.   Close elections were called as landslides by the polls.   What the polls called as close elections, were landslides.   And different polls often contradicted one another.

Yet some people - half-seriously - think we should dispense with elections and just use poll data to elect politicians. 

The problem with poll data is the same problem with surveys - the raw data.   There is a huge filtering effect in polling in that only people who want to be polled will answer polls.   Also, only people who can be reached will answer polls.

In Truman's era, it was said that a lot of rural voters, who didn't have telephones, voted for Truman.   Since the polls were based on telephone calls, a huge portion of his support was under-reported.   So the pollsters and political hacks thought for sure that Truman's time as President was up.   They were very wrong.

Today we have a different problem.  Well, two different problems.   First, people rely on cell phones.   And most people use caller ID to screen their calls.   So if you call people to poll them, they may let the call "bounce" to voice-mail and you don't get their opinion.   You particularly miss young people this way.  Only old folks have "land lines" anymore, and are willing to answer calls from strangers and think that their opinion matters.

The second problem is that we are over-saturated with surveys and polls, and many of these today are just scams used as come-ons to steal money from people.   So when you call up with a "survey" most people - particularly young people - will assume you are trying to scam them.   We have reached survey saturation.

Paper surveys have the same problem.   You can't reach young people who are on the move, and younger people are not going to fill out a 10-page survey as they are too busy.    Even online surveys go on for a frustratingly long number of pages - usually to extract demographic data from you - and many folks just get tired and quit - or just refuse to answer surveys.

The problem with all surveys, of course is that we lie a lot - to ourselves and others.   For example, based on self-reported survey data, 70% of all people claim they do not carry a balance on their credit cards.   But the credit card companies, who can tell you the real data with a push of a button on a computer, report that 70% actually do.   We lie to ourselves - and survey-takers.   Someone carrying a balance will say they don't because they ordinarily don't or so they think, not realizing they have been paying revolving interest for over a year.   We lie about our income.  We lie about our gas mileage.  We lie about our weight.  Men lie about their penis size.  We lie like rugs.  Self-reported data is the most suspect of all, rendering most surveys almost entirely worthless.

There is a lot to parse in this simple advertising phrase.

Credentialism is another aspect of surveys and data that is highly suspect.   The above ad for Trident was beaten into our heads on the television from the 1960's onward.   But almost every word of this advertising phrase is suspect and can be picked apart.   First of all is the phrase "dentists surveyed" - how were these dentists selected?  Randomly?  The ones who responded to a mailed survey?  Telephone calls?  What?   What questions were asked?  Was it "do you recommend sugarless gum for patients?" or was it "do you recommend Trident sugarless gum for patients?"   There is a difference.

The second part of the phrase is telling, too - "for their patients who chew gum."   In other words, these dentists may be saying "oh, don't chew gum, that stuff is awful for your teeth!  But if you insist on doing it, at least use a sugarless gum!"  - this is hardly a ringing endorsement.

Then there is the conclusion - tacked onto the logo above - that "sugarless gum is good for your teeth."  I am not sure this is a valid conclusion based on the statement below it.   The 4 out of 5 dentists are only saying if you have to chew gum, use a sugarless one.  Are they saying you should chew gum as it is good for your teeth?  I don't think so.

But of course, the under-the-radar thing here is credentialism that dentists are saying this, and as we all know, dentists are geniuses.   Actually, not.  There are a lot of idiot dentists and even doctors out there (and certainly lawyers).   We have a neurosurgeon running for President who thinks the pyramids in Egypt were built by Jews to store grain.  What little respect I had for doctors (and Presidential candidates) is out the window at this point.

When I was a kid, I used to think that Doctors and Lawyers were really smart people - and some of them indeed are.   But after becoming a lawyer, I realized how easy it is to get a law degree and pass the bar exam (no really).  Moreover, since I get the journals of both the Virginia and Georgia bars, I read every month the staggering number of people who are disciplined or disbarred for utterly stupid things.  There are a whole lot of idiots out there with law degrees, and medical ones.

Yet, in any debate, a losing side will resort to credentialism to prop up a shitty argument.  If their argument doesn't make sense on its face they will come up with a list of important people who they say agree with it.   They don't answer the counter-argument or say why their argument is correct.  They merely prop it up with names of important people, as if this were enough.

Related to this is the "awards" phenomenon.   Sooze Orman lists herself as an "internationally acclaimed" financial adviser, which is an interesting credential.    Someone from Canada once said something nice about me, so I guess I am "Internationally Acclaimed" as well.

A lot of awards are based on surveys - such as the J.D. Powers awards.  But since these awards are paid for by the companies they are awarded to, you have to be skeptical.  Often a new category will be created just so that some manufacturer will win.  And often these end up sounding pretty dumb - "Award for best mid-sized American-made SUV from a company with the initials G and M in its name" - that sort of thing.

Worse yet are awards or accolades given on entirely subjective evaluations.   Car magazines, such as Road and Track and Car and Driver (the same company, basically) will do a "mini-van shootout!" article and then award the first prize to the company that bought the most advertising pages.   They do this by weighting the "analysis" with subjective criteria.  "The Honda Odyssey topped in almost every category, and would have been the winner, but since 70% of our evaluation is based on ashtray location, the Chevy Astro van wins again!

Motor Trend, of course, is the worst of these, as its "Car of the Year" award seems to have either picked a total turd of a car, since 1970, or has acted as a curse on the car company in question (with Volkswagen being cursed this year, after winning last year).  Yet people rely on this data (and that of Consumer Reports, which is also famously inaccurate - calling the Tesla the "best car ever made" one year, and "unacceptable" the next) to make serious financial car-buying decisions.  Bad data leads to bad choices.

Even "hard" data can be soft, in many instances.  For example, data on crime rates, which would seem to be pretty easy to acquire, based on crime reports taken by the Police, can be inaccurate as different police organizations classify crimes in different manners.   Worse yet, some police departments are under political pressure to report fewer crimes so it makes it look like crime rates are going down when they may be going up, in that jurisdiction.   The FBI had tried to quantify "hate crimes" and "mass shootings" but has had trouble, as not every police department keeps similar records or quantifies these things in the same manner.

Statistics, even if they are based on "good" data, can be misleading, if the conclusions drawn are inaccurate or simply not justified.   For example, it is widely touted that the USA incarcerates more people than any other Western nation - indeed perhaps more than any nation, other than China.  And while "people of color" make up about 30 percent of the United States' population, they account for 60 percent of those imprisoned. 

Now, if you are a racist, you may use this data to argue that non-whites are morally inferior and more likely to commit crimes.   However that is a conclusory statement based on correlation not causation.

If you are a Lefty, you might use this same statistic to say that the USA is fundamentally flawed and that the justice system is "racist".   There may be a nugget of truth to that, when you consider the same judge in Texas who sentenced the "affluenza" teen to 10 years of probation for killing four people, sentenced a black teen to 10 years in jail for a very similar crime.

On the other hand, you might note that poverty rates among blacks are higher, as a percentage, among whites, and that impoverished people tend to commit crimes more often, get caught more often, get prosecuted more often, get convicted more often, and get longer jail sentences.  

 We have made a lot of progress in the "war on poverty" it would seem.

But you might also note there are more white "poor" people in the USA, simply because whites outnumber non-whites by more than 2 to 1.  So if the poverty argument is to make any sense, there would be more white people in jail than blacks.  Also, note from the chart above, that poverty rates have declined among minorities, since the 1960's.   I am not aware also whether black incarceration rates have gone up or down in recent years, accordingly.

This chart shows an interesting trend.   Our "get tough on crime" and "war on drugs" has increased our prison population considerably.

So what is the correct answer?   I think there is more than one.   I don't think non-whites are morally inferior to whites.  However, in many minority communities, a culture of criminality - celebrated in movies, music, and our culture - has been created in recent decades.   While there are Black gangs and Hispanic gangs (e.g., crips and bloods) there are few nationally-known white gangs of similar stature.

And the reason for this is that white poverty tends to be rural poverty, and thus the gang mentality and the culture of criminality that is prevalent in cities tends not to take hold in the country.   Perhaps there is a survey that validates this.   Perhaps no one has bothered to ask.    Do poor white kids who grow up in the inner city end up in jail more often than poor white kids who live in a trailer in the country and go to the evangelical church?   I suspect so.

(That is one reason I say, as a personal choice, to get the fuck out of the ghetto any way you can.  If you live in such an impoverished, inner-city neighborhood where criminality and criminals are celebrated, it is only a matter of time before you are a victim of a criminal, you become a criminal, or one of your children becomes a criminal.   Why people stay in shitty neighborhoods is beyond me.  And no, it is often not a matter of what you can "afford" but choosing to have consumer goods over a good address.)

So that is one aspect.   Racism is indeed another, but maybe not as strong at the first.    Black offenders are given longer sentences and prosecuted more often than white ones.  While often this is because the white offender has a better lawyer, it may also be true that there is an institutional bias.   But I don't think that explains it entirely, as the prosecution rate of minorities in cities where the arresting officer, the prosecutor, and even the judge are likely to be minorities as well, seems to be just as high.

And the third answer is, as shown in the chart above, the increased incarceration rates in the US.   The "three strikes and you're out" law, which was part of our "get tough on crime" program, put people into jail who in the past would be through a revolving door of prison again and again.   You may be too young to remember this, but in the late 1970's crime rates soared (again, see the chart above) and places like New York City were basically unsafe.  Times Square was a place where you got mugged.  The New York Subway was like a graffiti-covered urinal - and a good place to get mugged as well.

And another part of this was the "war on drugs" which was ramped up as the crack epidemic (and now the meth epidemic) swept the nation.   And yes, a lot of drugs were dealt in the inner cities (and was easier to detect and prosecute in the inner cities).   It will be interesting to see how legalization of marijuana will affect these incarceration rates, over time.

But the main thing is, what conclusion you draw from the data depends a lot upon your preconceived notions.   If you are a racist, you view this as proof of your beliefs.  If you are a leftist, you view this as proof that the USA is a horrible racist place.

If you are a realist, on the other hand, maybe you see the conclusion as more complex and nuanced and based on a number of inter-related and unrelated factors.  Conclusions and data are two different things, and we can (and do) routinely draw the wrong conclusions from data, because we have a preconceived notion that we want to validate.   We all do this.

And speaking of incarceration rates, while we lead the world in locking people up (other than China, apparently), our crime rates have been dropping for decades.   Are the two related?   Are crime rates dropping because we are locking up the bad guys?  Or it is because we are not counting crimes as much as we used to (for political reasons)?  It is an interesting question, I don't have a real answer, other than whether or not someone should be released from jail should depend on what crime they were convicted of, not some overall statistics.  People are not statistics, and should not be treated as such.
Now, the online world has - or had - great promise for pollsters, marketers, and survey-takers.   Since the actual sites you visit and what you click on can be counted precisely, one would think that online data would be so much more reliable than self-reported survey data.  But a funny thing happens - online data can be skewed as well - as much if not more than, traditional data.

Consider online surveys.  Almost every site you visit wants you to take a survey to tell them how you feel about the site.   Some of these, as I have noted, are a come-on to sell you magazine subscriptions or other services.   Others are worthless as they rely on the person wanting to complete the survey.   And there are two, perhaps three reasons why people complete surveys online.

First, the smallest group, are the people who like the site and want to compliment the company for putting up a good site.   Since most people expect a website to work properly this is a very small group of people.   Very few "satisfied customers" will complete the survey.

Second, are the lonely-hearts.  This is also a small group - and shrinking rapidly as our country ages and people naive enough to answer surveys die off.   These are the folks dumb enough to think their opinion really matters, so they dutifully fill out surveys, thinking this is a sacred right, like voting.

Third is the overwhelmingly largest group - the pissed-off people.  These are the folks who got fucked by the company in question and want to take the survey to give it all 1-star or less.   They are mad, and the survey acts as a punching-bag for their aggression.

So the voluntary survey ends up being a huge filter, and what you end up with is a few positives and mostly negatives.  I am not sure this data is helpful to anyone.

Online reviews work the same way.   Whether it is TripAdvisor, Yelp!, Amazon, eBay, or even a review on Netflix.  Everyone, it seems, gives five stars or one star (or the new "5/7" rating - it's a meme).  And one reason people do this, is to skew the rankings.

Restaurants will hire shills, or ask friends and family to put up positive reviews of a place (and they are not hard to spot, either) with all five stars.   People who maybe had a three-star experience (which used to mean "acceptable") will give one-star, just to bring the ratings down a notch.

Nowhere is this more true than on Netflix, where it seems that everyone gives one or five stars.   Due to "star creep" we have reached a point where anything less than a perfect score is deemed suspect.  Car dealers will beg you to give five stars on the survey the company (or JD Powers) sends you, as they need as many positives as possible to counteract the pissed-off customers they've accumulated.  The star system is wholly broken and doesn't work.

The same is true for eBay.   If someone has a feedback less than 99%, you may be courting trouble.   People live and die for perfect feedback.   You can't afford to have even one pissed-off customer anymore, it seems, unless you are an airline, in which case you just piss everybody off.  After all, what are they going to do?  Go to another airline?  Ha. ha.

Click-counting is also problematic and suspect in the age of click-bait.   Advertisers pay websites based on the number of people who view a page with their ad on it, and pay even more if customers "click through" to their own website.   To generate income, website owners have resorted to "click-bait" which usually means controversial titles to what are mundane articles.   Usually these are the "You'll never believe...." type or the "10 best" or "10 worst" kind of "list" articles.   And we all click on them, too!

But whether advertisers are really getting their money's worth is a good question.   I think many people are immune to sidebar ads these days.  And with pop-up blockers and ad blockers, well, a lot of ads are just not getting seen.   

So, to get around that, "sponsored content" was created.   These are ads disguised as content.   And it gets worse.   Now they put up postings or respond in comment sections with innocent-sounding comments that just happen to mention a product by name.   Almost all of these are ads.   But I am getting away from the main point of this posting.

We reply on data to fill in our worldview - our reality.   And if the data is bad, well, our worldview is as unreal as some simulation in The Matrix.  We end up living in an alternative reality

Take this young couple in San Bernandino - or young Muslims in the UK who go off to join ISIS.   They are living in prosperous Western countries, often have good jobs (or access to social services) and are far better off than their friends and relatives "back home".   In the case of the San Bernandino couple, he had a good government job that had great benefits and a guaranteed pension.   A nice house, a huge SUV, and apparently enough disposable income to buy a cache of weapons and ammunition.   Oh, and a baby as well.   Everything you need in this world to be happy, they had.

But thanks to online radicalization (the mother of all bad data) they decided they were unhappy.  And they also decided it would be a good idea for Islam to kill a bunch of random people.   Of course, whatever their goals were, it backfired.  These sorts of incidents only serve to convince people that Muslims in general are dangerous.

What is the reality they missed?  Well, for starters, the real issue isn't Islam, but Oil, or more concisely, power and who is wielding it.   Religion has been used, over the ages, to control people and as a pretext to get people to act against their own self-interests.  The whole Sunni/Shiite rift (or Catholic/Protestant rift) is less about religious doctrine than who is in charge.   Yet all four religions have managed to convince people that there are deep doctrinal divisions that are worth killing over.

Of course, it helps if the people you are trying to manipulate are a little crazy to start with (or a whole lot crazy).   And what is "crazy" anyway?  It is just an inability to differentiate reality from fantasy.  And it usually sets in around adolescence, when young people fail to make the transition from child to adult, because reality seems to scary - so they retreat into a fantasy world.   And of course, there are a host of people out there, willing to sell you an "alternative reality" on a moment's notice.

Or take the people obsessed by Donald Trump.  Doesn't matter if the are for or against him, they are obsessed.   And they harp that he's leading in the polls and could be the next President.

But as a Next President, the Donald has done remarkably little in the way of trying to get elected.  No real campaign.  No fundraising.  No "get out the vote" drives.  No real campaign organizations in either Iowa or New Hampshire.   In Iowa the other day, he told people to "be sure to vote" - failing to realize it is a caucus state.   By March of 2016, his campaign will wither only because it never existed in the first place.   Polls are wrong - bad data.   And no one seems to even want to address this.

So, how do you accurately perceive reality in a world that looks like a funhouse mirror?  It ain't easy, I'll tell you that.

The first step is to read and read in-depth.   And by this, I don't mean to immerse yourself in a fantasy world of books, particularly odious conspiracy theory types or political tomes of the right or left.  You are not going to perceive reality any better by reading Mother Jones or the National Review.   What I mean is to read more than just headlines or click-bait titles, or whatever it is that is in the "feed" on Facebook.   Read source documents in their entirety.  Often what someone says a document means is entirely the opposite of what they claim.

Second, be skeptical.   If someone says "X is true" think about whether this proposition is indeed true and what possible alternatives make more sense.  What underlying assumptions are being made to support the proposition?   In any argument, challenge the premise.   And that is akin to looking at the raw data and how it was acquired to see if it really supports the proposition, or in fact, is a shaky foundation.

Third, use your experiences.  Painful experiences are the most profound.   If you get ripped-off, it is a painful lesson.  But if you learn from it then it is a lesson well-learned.   Sadly, a lot of people get ripped-off and then go back to the well again and again.   Ads for MLM schemes say, "Ripped off by an MLM scheme?  Try this one, it actually works!"    Or how about, "Burned by a PayDay Loan?  Call Consumer Financial!  We have a PayDay Loan to pay off your PayDay Loan!"

People who don't even learn from their own personal experiences are destined for trouble.   And there may be a lot of psychological defense mechanisms that cause this.   First, we don't want to admit to error.  We didn't make a mistake by getting a PayDay Loan!  It was the fault of the PayDay Loan place!   Second, we like to "bury our mistakes" in our minds, rather than admit to them.  So when we do something dumb, we put it out of our minds, rather than learn from it.

Personal experience is the strongest form of reinforcement, as I noted in my Three kinds of learning.   But you can also learn from the mistakes of others - provided they are willing to own up to them - and also project - based on your experiences and common sense (logic) how some unfamiliar scenario might work out, even if you have to hard data to start with.

What is clear is that there is a lot of bad data out there and it is very hard to perceive reality as it is.  If it was easy to perceive reality, then a lot more people would be financially successful, there would be less strife and war in the world, and the world would be a better place.   On the other hand, since so few people perceive reality as it is, or perceive it very well, even if you do a half-assed job of getting it right, you will be head and shoulders above the majority of people on the planet.

It ain't easy.  Then again, it ain't impossible, either.

P.S. - and bear in mind that you are not a statistic.   Regardless of what your friends or family or socioeconomic group is doing, you don't have to make the same choices.