A Computer Says What? Experimenting with Synthetic Text Generation

artificial-intelligence-deepfakes

Recently, a group of machine learning researchers trained and documented a state-of-the-art natural language generation system that performed so well on many language modeling benchmarks, that they were called the ‘deepfakes’ of text generation. 

If you aren't familiar, deepfakes is a deep learning-based approach to creating believable photos and videos using an amalgamation of media, and has caused quite a bit of soul-searching in the news industry the past two years.   OpenAI’s GPT-2 technology opened up a new debate related to text: When is Technology Too Dangerous to Release to the Public?  OpenAI explained their technology and reasons, and fortunately for AI & machine learning researchers, provided a smaller trained model for others to experiment with and a technical paper describing the approach. 

Their model uses over a billion parameters trained on over 8 million Web pages.  The objective is simply to use that model to help predict what the next word in a sentence will be, given all of the previous words within some corpus of text.  According to them, the size and scope of the project includes 10x more parameters and 10x the data as previous language models.  For tasks like reading comprehension, machine translation, question-and-answer, and summarization, the results are a cut above anything that had been trained to date.  These sorts of tasks typically require task-specific training, but they were able to draw state-of-the-art results with only generalized training on raw, unstructured text. 

One of the most interesting capabilities of this model is to generate synthetic text based on some sample of input.  In practice, a human provides a short sentence (or three) as a sample, and the system generates a lengthy continuation of the prose, in a manner believable enough that a human could have written it. 

At Bitvore, we couldn’t wait to get our hands on this in order to play around with it.  We’ve seen almost 200 million news articles flow through our AI-analysis pipeline, so basically we’ve seen nearly everything. Our main interest in looking at this technology is threefold:

 

1. Training

One of the problems with our system is that for some infrequent concepts, we don’t have a large enough sample dataset to fully train and validate our AI models.  We’ve done task-specific training in order to make sure the product of our machine learning is robust and accurate enough to use.  This sometimes requires generating additional training text and materials to guarantee enough records as input.  Using a synthetic text generation tool, we could simply give a few random samples and increase the number of records for both training and validation.  The benefit would be faster training, more accurate models, and less manual identification of input data.

 

2. Robonews

Robonews is the automatic generation of text according to a template. It is far more common than most people realize.  While the technique has mostly been applied to pump-and-dump stock scams or online healthcare spam, it's starting to catch on more widely for other news content.  To date, robonews has usually been formulaic by plugging in textual parameters to mostly deterministic text generation tools.  We have been able to deploy structural and pattern-based tests to identify and/or block most of this type of content.  As the quality of synthesized news articles becomes more prevalent and similar to the detection of deepfakes in images, we’ll need to look at more sophisticated models to fingerprint robo-generated idiosyncrasies in the text.

 

3. Lack of Content

Bitvore uses AI models to analyze tens of millions of pieces of news content.  We have a concept of precision news which is the most relevant and impactful news for any given company or municipality.  Surprisingly enough, the long tail of companies and municipalities, i.e. the ones that aren’t constantly in the news, don’t have enough information about them on a week-by-week or month-by-month basis.  Despite the paucity of information, there is a real need for human digestible news stories to help augment the information about these less covered entities.  Instead of getting “firemen saving cats out of trees” or “company picnics,” we are interested in using small, sub-story tidbits of information to augment the content in an automated way.

We wanted to put this last task to the test.  After checking out the git project, installing the requirements, and downloading the model, GPT-2-117M, we were ready to roll.  The authors warn that some of the worst case behaviors are not well understood, the training text contains multiple factual errors and biases, and any synthesized text generated should be clearly marked.  Like us, they are very interested in understanding malicious uses of the tool and the corresponding ability to auto-detect the synthesized text. 

For this experiment, we picked a company at random from our company list: CadburyThe running GPT-2 requires some user-supplied text called a “model prompt,” so we supplied it with the last 15 Cadbury headlines.  To run the experiment, we tinkered with several parameters to the model.  There are flags the system uses, such as how deterministic the word choices are and variability of the text, but we won’t get into those details here, other than to say we changed them from run to run.  The titles of the articles used to seed the model are listed:

  • Cadbury chocolate maker Mondelez's sales miss estimates
  • Cadbury Creme Egg Hacks UK's Biggest Brands for Real-Time 'Easter Eg | LBBOnline
  • Cadbury cuts 50 jobs from Hobart factory
  • Cadbury extends Boost portfolio with Peanut Protein variant
  • Cadbury Is Launching Vegan Bournville Buttons | Vegan Insight
  • Cadbury launch Easter egg version of this chocolate bar
  • Cadbury launches Choc'Shake milkshake powder
  • Cadbury launches darkish salted caramel chocolate bars and chocoholics are loving them
  • Cadbury Launches Free Secret Santa Postal Service
  • Cadbury loses purple wrapper legal battle - paving way for rivals to use iconic colour
  • Cadbury Nigeria Returns To Profitability Approves N305. 51m Dividend
  • Cadbury owner Mondelez stockpiles products prepares for hard Brexit - the Times
  • Cadbury rolls out 'first ever' Bournville sharing bag
  • Cadbury Unveils A Creme Egg Block Of Chocolate
  • Cadbury's warns Dairy Milk provides could DRY UP as farmers rising chocolate are 'dying out'

 

Result 1

 

The company said it would cut its sales by about 30 percent in the next five years.

The company said it would cut its sales in the U.S. by about 30 percent in the next five years, and it plans to cut its worldwide sales by about 25 percent.

The company said it will cut its global sales by about 25 percent in the next five years.

The company said it will reduce its global sales by about 25 percent in the next five years.

The company said it will reduce its global sales by about 25 percent in the next five years.

 

The first thing to notice is we had set the parameter wrong for repeating/variability.  It reminds us of the movie, The Shining“All work and no play makes Jack a dull boy.”  Secondly, no company on the planet would release an announcement they intend to cut their sales in the U.S. by 30% or globally by 25%.   That's not how business works.  We did see some announcements in our data about cutting a global sales workforce by 5%, or sales declining 20% last quarter, but nothing related to predictions of reducing sales 5 years out.  While the text is related to the financial and economic health of the company, it’s unlikely this story would be able to slip one past our goalie, so to speak.  This would be tagged as junk at worst, and because of the repetitiveness and future-speculative and non-precision, at best. 

 

Result 2

 

The first of the four egg hacks that will be available in the UK this year is a new feature that makes it easier to make homemade egg shells. The new feature, called 'Eggshell-Egg-Egg', is available on the website of the company which makes the egg shells. It is a very simple process that requires only a few clicks and the egg will be ready when it is placed in a baking dish. The egg shell is then placed in a baking dish and the process is repeated until the egg is completely cooked. The eggshell-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg-egg

 

This also provides some amusement.  At Bitvore we have a signal for new product launches and product updates.   If there was a company with high relevance in the synthesized text, it’s possible this may be flagged as a new product or a new feature of a product.  The naming of the new product/feature is particularly humorous: Eggshell-Egg-Egg.  It certainly is memorable and there are worse product names out there. 

 

Result 3

 

, but the company says it will continue to make its own products. 

The company's chief executive, John Tynan, said he was pleased with the decision. 

"It's been a very difficult decision, but we're very happy with the decision," he said. 

"It's a great move for the company and for the city of Hobart." 

The decision comes after a series of attacks on the company by people who say they were misled by its products. 

The company says it is working with the police and the public to investigate. 

Topics: business-economics-and-finance, government-and-politics, hobart-2160 

The Times reports that the Trump administration is considering a law that would allow a federal judge to order a person convicted of a crime to pay $50,000 in restitution to the victim. 

 

This one is curious.  Recently in the news, there was a story about a person attacking a politician by launching a raw egg at him.  While not company news, there are enough phrases that there might be some semblance of understanding what the synthesized text is trying to say.  The AI seems worried people are going to attack politicians with Cadbury chocolate eggs, so it's reporting the government is considering a law to pay restitution for being chocolate-smeared.  Bitvore would have outright junked this article.

 

Result 4

 

The company has also launched a new line of peanut protein products, which include a peanut protein stand with peanut butter, peanut oil, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter, peanut butter,

 

This is another example where we set the repetitive parameters incorrectly, though with some humorous results.  It reminds us of the famous military boot camp “Bubba Gump Shrimp Company” scene in Forest Gump where he lists all the different ways you can process and sell shrimpMaybe the company is the Bubba Gump peanut butter company?  Perhaps they should cross-train with the (NSFW/language) Full Metal Jacket Jelly Donut squad.

 

This may have been flagged for a new product launch, though the repetition would have triggered a rejection in Bitvore. 

 

This brings us to the final result which we'll post below, with commentary.  Bitvore provides precision news. One of the criteria for precision is that whatever events happens, aka a 'signals' in Bitvore parlance, it is financially relevant to a specific entity, aka a company.  If a signal and a company don’t have a high enough score, then it reduces the chances the news record will be tagged as a precision news item.  Broad industry news sometimes fails to meet that criteria.  Industry and market reports sometimes have dozens or more companies listed in their report.  This waters down the relevance of the news, and the relevancy of any detected signals as applied to that company.

Bitvore has a way to classify industry news as an alternative to precision news. Some of our customers like the fact that we can roll-up news about and industry or roll-up industries or companies about a signal like cash raising or mergers/acquisitions into content wide reports.  We’ve been looking at ways to auto-generate issue- and company-specific intersections of our data and report it in human digestible form.  Given that goal, here is the next result.

 

Result 5 (Synthesized Text, please do not quote w/o attribution or warning)

 

The world is seeing the growth of chocolate from one-time 'mocha-eating, sugar-coated chocolate' into what's described in marketing circles as a 'high density, high fat, high fat' chocolate.

 

Double ‘high-fat’?

 

But Dairy Milk has warned that chocolate won't keep up with the growth in consumption of other types of chocolate and that the consumption of 'dairy-based chocolate is expected to accelerate,' a daily study suggests. 

 

This is interesting, as there is some support that “dark chocolate,” aka non-dairy chocolate, is growing faster than dairy-based chocolates.

 

Researchers examined how the chocolate consumption of cows, calves and goats has changed in the last 12 months of this year.

 

The chocolate bunnies will be happy to learn that people will now eat chocolate cows, calves, and goats. Though will they eat the ears and heads first, and will they be hollow?

 

Dairy Milk: The growth of the global chocolate market is expected to accelerate by 15 per cent every decade to 2020 - according to an analysis published yesterday (Nov. 15). 

 

There aren't many decades between now and 2020, but the chocolate market and dairy market are tightly intertwined and grow concomitantly.

 

Chocolate: Recent studies have suggested that a shift towards more 'dairy-based' foods is taking place as more and more consumers of the brand, and the dairy industry, increasingly want to use it at home and, on a weekly basis. 

Chocolate consumption in China has more than doubled to more than 200 milligrams of dairy products per day. 

 

It’s difficult to find metrics that can measure chocolate consumption by milligrams per day, but there is support that the chocolate market in China is growing

 

The increase in cocoa consumption has helped boost chocolate demand across Europe, but, as with all chocolate products, the decline in chocolate consumption is in evidence. 

Dairy Milk: The recent reports have suggested that the current high demand in the South East Asia region will continue on, for at least three years to come. 

Chocolate: Last year, the global chocolate market grew by over 10 per cent annually - a decline of 12 per cent per year.

 

The market did grow in 2019, though nowhere near the numbers stated.  Previous years were as high as 12%+ per year. The intermixing of both growth and decline percentages seems contradictory.

 

Chocolate consumption in the USA has become more affluent and has become more popular in recent years, with over half of a million consumers from the USA seeking out the brand in their home. 

Chocolate consumption is forecast to have nearly doubled during the next six to eight years due to a rise of the global chocolate brand in terms of volume. 

This was followed by a doubling in chocolate consumption in Germany to over two tons per year in the last four years and the US to over 1,000 tons in the same time. 

 

Trying to find U.S. and Germany specific, per-capita consumption numbers is difficult.  We did find a few reports that support the over-specific descriptions within the same ballpark, though not entirely accurate.

 

Mountain Lion and its supporters have previously said that if they continue their pursuit of chocolate, it is a long shot but they continue to offer a wide range of quality products and can easily earn big returns. 

Earlier this week, a report revealed that Mountain Lion was taking orders for one of its new chocolate-based dairy chocolate brands and is "growing increasingly confident" about

 

The Mountain Lion comment was a mystery at first, but it seems it's actually a popular, chocolate snack.

 

The bottom line: this Industry-news article might actually have passed our non-precision news analysis.  Truthfully, having read a lot of industry news as examples, it’s just as good or believable as anything we’ve seen written on the chocolate manufacturing industry.  Without careful analysis, common sense, or detailed fact-checking, we can see this synthesized text passing the “smell test" - at least for automated analysis.  Aside from some out of numerical bound predictions, humorous substitutions, and non-standard way of describing metrics, there's definitely an increased degree of believability.  OpenAI might have been appropriately cautious in not fully releasing their source code or models.  A little tuning, optimization, and common sense could be enough to churn out fake-but-believable industry reports that can pass human review.  Once that happens, we’ll have to come out with better software to help identify the robofakes in an escalating battle.  As the old Cold War saying goes: they’ll have their nukes and we’ll have ours. 

Try Bitvore's Interactive Data Tool

Greg Bolcer, CDO Bitvore

Greg Bolcer, CDO Bitvore

Greg is a serial entrepreneur who has founded three angel and VC-funded companies. He's been involved at an early stage or as an advisor to at least half a dozen more. Greg has a PhD and BS in Information and Computer Sciences from UC Irvine and a MS from USC.