The ‘Fake News’ problem and its NLP ‘solutions’

One of the saddest realities of the internet age is that the early optimistic maxim of technologists that “the world wide web will bring unlimited access to knowledge and learning to all” turned out to be horribly naive. The internet is really pretty great, but the ’consumer internet’, serving the public such things as Facebook, Twitter, Reddit, Instagram, Tumblr, etc. is no better at producing an educated people than the radio stations, magazines, newspapers, and television channels that came before them. ‘Fake News’ is possibly the malignant apotheosis of the internet’s hyper-share-ability and openness. It came to a head in the United States’s recent disaster election period. Supposedly paedophilic pizza shops were shot up, Barack Obama was repeatedly accused of being foreign-born, and a horrid man was elected to power by millions of dupes. Given that this problem was predominantly created by tech companies full of programmers, programmers found themselves feverish to take it on fighting bad good algorithms with good algorithms.

The only thing that stops a bad algorithm with a gun, is a good algorithm with a gun

The focus on bad algorithms created a myopia amongst the various people popping up with supposed solutions on Reddit, Hackernews, and Github. Tensorflow was, and still is, all the rage, so a common approach was to grab the nearest Natural Language Processing tool, usually a text classifier, throw a bunch of news articles at it, and offer the resulting model as if it did anything to fix the problem at all.

Here’s one approach that takes a bunch of New York Times and The Guardian articles as “real news” (Chomsky raises an eyebrow), and a Kaggle ‘fake news’ dataset as the ‘fake’ stuff and feeds it into tf–idf . All you’ve got here for your trouble is a language model that associates “Trump”, “#MAGA”, and “MUSLIMS!” with fakeness, and stories containing phrases like “requisite parsimony” and “For as little as $1, you can support The Guardian” with truth.

Here’s another one that does basically the same thing. This approach to determining the truth of falsity of online content is folly. Basically any language modelling solution that attempts to say whether than some single statement is true or false let alone a whole article is a non-starter. We can do work in the areas of “stance detection” and Natural language inference but these things are no where near a solution to ‘fake news’. When I worked at Zendesk on their Virtual Customer Assistant, Answer Bot, we had such a hard time getting a model to reliably figure out which article in a group would help a customer cancel their shipping order I couldn’t imagine placing our trust in that same technology to literally arbitrate on what is true.

What if you could actually get it to work though?

What should be a further knock to a coder’s enthusiasm to break out Tensorflow and a Kaggle dataset is the research suggesting that identifying fake news and tagging it might not actually work. Once fake news is ‘in the system’, it being identified as fake doesn’t do nearly as much as we would like to reduce believability and share-ability. To the extent that people don’t care about truth and reality, any mythical fake-news detector will be powerless to correct their errors.

Now it can be pointed out that a newsfeed could go a lot further and rather than merely tagging, actually remove fake news or prevent its posting. This would create a much healthier information ecosystem on that platform, but really would have only dodged the core problem rather than solving it. The core problem is that even the best humans can only track reality in some places most of the time. The best are, ‘smart in spots’, as Warren Buffett puts it. This is not an ideal situation, as no single (even highly knowledgeable and intelligent) person can be relied on to discern what is true and what is false in general, they may only do so adequately in specific areas. That’s the plight of the brilliant, for the worst or merely average, on nearly every issue concerning humans and society they will find themselves lost to the direction of the truth and, thanks to the Dunning–Kruger effect, unaware how lost they are.

What actually might work

What’s basically happened in the last 10 years is that we’ve replaced the traditional media institutions with new ones, and these news ones have a significantly different information dissemination architecture. When our media system used to be supply-driven rather than demand-driven , information and content was passed down to the public via the ‘4th Estate’. These journalists and their tycoon owners had a set of ‘journalistic standards’, and though you’d know they were of course regularly unsatisfactory, at the end of the day you simply couldn’t get a job at The New York Times by writing stories about how Hillary Clinton was a secret child molester that operated out of a Washington pizza shop basement. If the Facebook wall is to become the dominant news-media space instead of television, then we’ll just have replace the information dissemination architecture we’ve lost, and maybe if we’re smart enough, with a better one.

I’d agree with Frederic Filloux’s proposal that fixing online news is mostly about managing reputation. The internet opened by floodgates, allowing anyone and everybody to participate. While that’s lovely and democratic, it’s not how we’ve wanted our information systems to actually function. Where it’s too hard to attack the ‘is it true or false’ question directly (basically always), we fall back to numerous proxy signals for information quality. Take what Barack Obama says here about how it is bad that on Facebook the words of a scientist and a guy in a basement look the same, and are subject to the same sharing rules. It’s this kind of thing that we have as part of our current information dissemination system and it’s pretty stupid. In Filloux’s piece above, he details a reputation management system that is broadly the same as one that I would recommend, though I’d go a bit further. It is clear now that it’s not just the news media sites that must have their reputation assessed, but also the users of platforms like Facebook, Twitter, and Reddit. Users that drive content sharing, so we should naturally care who is sharing and not just what is being shared. But importantly, an increasingly important information digestion pattern seen in social media is that user won’t even read articles before jumping straight into the comments section. Thus everything they’re consuming, which is still of course news related, is generated by other users. We should also start developing systems to manage user reputation, certainly beyond the capabilities of say, Reddit’s karma system.

Though I’ve criticised NLP-based solutions here, NLP is a fantastic tool and not the source of the trouble. The problem is overall I think a combination of AI-hype based enthusiasm and a lack of respect for the problem space. The latter is something that has even been committed by Deep Learning researchers at top AI labs and lambasted by Yoav Goldberg, and I think will persist as long as software keeps entering every area of human life while the programmers behind that software remain focused on bits and bytes. A journalism school graduate would not be so myopic in their approach to the ‘fake news’ problem, just as a linguist wouldn’t so easily fall foul of Yoav Goldberg.

The ‘Fake News’ problem and its NLP ‘solutions’

The only thing that stops a bad algorithm with a gun, is a good algorithm with a gun

What if you could actually get it to work though?

What actually might work

More to read

The First LLM (Mar 23, 2025)

What's in a name? The Data Scientist vs. Machine Learning Engineer title bore. (Oct 30, 2021)

How can an ML model perform highly and poorly at the same time? (Jul 4, 2021)

An Undergrad's perspective on ICML 2017 (Sep 11, 2017)

The ‘Fake News’ problem and its NLP ‘solutions’

The only thing that stops a bad algorithm with a gun, is a good algorithm with a gun

What if you could actually get it to work though?

What actually might work

Want to get blog posts over email?

More to read

The First LLM (Mar 23, 2025)

What's in a name? The Data Scientist vs. Machine Learning Engineer title bore. (Oct 30, 2021)

How can an ML model perform highly and poorly at the same time? (Jul 4, 2021)

An Undergrad's perspective on ICML 2017 (Sep 11, 2017)