Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Search
Close this search box.

Tackling Big Data: What it Takes to Raise an Algorithm

As evident in daily news headlines, disinformation—whether home-grown or influenced by foreign actors—can undermine democratic institutions, destabilize societies, and challenge science. One of the main opportunities to detect disinformation is using machine learning, a form of artificial intelligence (AI).

Machine learning has the computational power to automatically ingest and review terabytes of data in real time and display a million conversations on major social media platforms and online media outlets. Dynamic data feeds and complex social network graphs allow analysts to research and predict who is spreading disinformation to whom, where, and how at an unprecedented scale.

Yet, for the average program manager working on governance and stabilization, AI is still a foreign language. Technology is complex. Most IT firms talk in scrum methodology, work in sprints, write in code, manage by output, and require precision. Most development programs are non-linear, operate in constantly fluctuating environments, and have more idealistic and longer-term goals like combating corruption, reducing climate change, creating jobs, or securing peace.

Nonetheless, these two cultures need to learn each other’s language if activities to quickly detect, address, and prevent disinformation are to succeed.

Learning the Language

To test this assumption, we brought together a few program managers with a small team of student data scientists and senior engineers to build a machine learning system ourselves. While we have excellent partners that provide this service, our goal was to learn the language of AI directly so we could better engage on more complex queries in non-English contexts.

We developed, trained, tested, and refined a data scraper and machine learning algorithm to collect and classify the millions of public social media posts discussing the COVID-19 vaccines before their approval. The viral spread of COVID-19 disinformation provided an ample data set on how online information can be weaponized to influence attitudes and behavior. We then launched a multitiered database to provide live data to a custom website that showed the everchanging facets of the vaccine conversation, from trending topics to top media sources and influencers. We analyzed existing tools that classified information based on sentiment and disinformation. We became investigative journalists, hackers, and fact checkers. We created an information index that benchmarked information legitimacy against a few non-partisan academic organizations working on the vaccine.

In short, we developed an in-house software application that captures millions of data points around public social media conversations—think big data—with AI and predict distortion with machine learning. We called it SOLIS, short for social listening.

What We Found

When we applied SOLIS, it showed the level of activity in terms of volume of disinformation against key news milestones, such as the announcement of trial results or political events. With this data, we dove deeper into what fomented disinformation: What content was behind the disinformation spiking? Who were the influencers? Where on the legitimacy spectrum was their content? What was their scope of influence? What were key groups talking about as additional news unfolded?

We found some striking patterns. For one, there was a spike in legitimate tweets based on official statements from government sources, when we had anticipated an uptick in disinformation. We could see how, for example, the company Pfizer “pre-bunked” information, providing evidence and examples that got ahead of disinformation. Later vaccine rollouts, however, showed weaker results as the disinformation ecosystem became better at pushing back on facts. Not surprisingly, politics were at play, undermining vaccine science, promoting herd immunity as an alternative to injections, discrediting trial results, amplifying side effects, and connecting doubters with other conspiracy theorists.

We learned that some platforms were greater purveyors of disinformation than others, and more vitriolic. YouTube had the most dangerous levels of disinformation, followed by Twitter, then Reddit. However, Reddit platforms had the most informed debate and engaged directly with scientific experts. Knowing this can lend greater focus to positive communications campaigns.

Lessons Learned

Despite the value of these insights, the process of developing SOLIS was not a straightforward one. We learned quickly that an algorithm—a set of precise instructions for computation or to solve a problem—is only as smart as the humans who program it. It needs lots of nurturing before it can walk on its own.

As such, we had to find a common language to achieve our combined AI and human objectives. Some additional lessons learned include:

  1. Get the data query right. Quality data came only after we refined our data search to capture slang terms for the virus and delete extra words that brought up general data about the virus in general not specific to vaccines. For any program manager using AI, review and refine the query with a few sample tests to avoid headaches in the long run.
  2. Lead with social science. Technology companies can only do so much in a vacuum. They require the right mix of people—social scientists, linguists, and development practitioners—with a shared understanding of AI/machine learning, social media analysis, and the entire ecosystem in which digital misinformation and disinformation operate.
  3. Invest in the infrastructure. To sustain large data sets, invest in database design and manage cloud storage capability; program users underestimate the needs in between.
  4. Use data to tell a story. Standard graphs do not translate to using data in meaningful ways. To understand and facilitate transformation around social narratives, know your audience, understand the user experience, and use data to tell a story versus just display.
  5. Find the outliers. Seek out new influencers, those whom people already trust, have a strong impact on spreading information. Identifying popular, non-political players as behavior change advocates or topic influencers can continually impact millions of followers.

Start Small, Think Big, Resource Diversely

When digital and development experts come together to speak a common language, they can foresee emerging conflicts and populations vulnerable to disinformation to predict the next info-demic. They can partner with institutions and communities to resist disinformation and diminish its influence over time.


Kelly Garman and Tamara Babiuk are part of the in-house and consultant team that built SOLIS. Kelly develops customer relationship management systems at Dexis. Tamara leads disinformation programming at Dexis.

Photo by Artur Widak / NurPhoto / NurPhoto via AFP