During the first 15 days of 2019, we blacklisted 35,102,894 items from ingestion into our system. That puts us on pace to surpass last year's number of 840 million individual pieces of content blacklisted. An important part of our data removal process includes filtering out irrelevant information typically found in 15 online content categories listed below.
At Bitvore, blacklisting means that any URL or Web page containing that pattern in its Web address will not be ingested into the system or analyzed as it doesn't provide value to our customers. We call this our first line of defense. Think of it this way. When you do a Web search, you may see the total results listed somewhere on the results page. Depending on how specific the search query is, you may get a notice that millions of queries match or none, but most likely somewhere in between. We take some number of top results from our various commercial and non-commercial feeds and process them through the system.
(As an interesting aside, if you do a search query that returns only one result, that is called a "hapax legomena" or a "hapax" for short. Also, as another interesting aside, some of the search engine results that we purchase sometimes return very odd results. Due to the fact that some of the obligors and corporations we track seldom have any news, we encounter what we've dubbed the "Beyonce" effect. When there are no results for a very specific query, some of the paid search APIs we use substitute in matching records of very little relevancy. These results sometimes are simply popular results based on other searches during the same time period. At one point, I spent half of a week trying to figure out why we were getting so many Beyonce articles into the system.)
Bitvore's blacklist currently clocks in at a lean, mean 9,800 items that are both added by hand and automatically generated. Originally, items were either website domains or domains and content paths. As the content in Bitvore grew, we had to add more and more items to the blacklist until we supported pattern matching. The list got as high as 170,000 items to block before we re-implemented the system to be smarter about wildcards and patterns. We don't have to blacklist a whole site. Using content paths, we can partially blacklist parts of a site that aren't interesting while still collecting from the part of the site that is. For instance, we may want the news section of a site, but not the job listings or gossip parts. Anything that can't be blacklisted by the Web address/URL can be "junked" later in the content analysis process too. Junking in Bitvore just means eliminating it from our analysis pipeline based on keywords, concepts, quality, topics, or other means.
Blacklists can be any number of things, each of which we can turn on or off in order to add content or do deep analysis. Below are some of the top sources for blacklisted content:
1) Non-English Language Sites - Because Bitvore and our customers (right now) are English-only, we automatically blacklist non-English sites from our data.
2) Old Data - Surprisingly, there are a lot of ways to identify old data that don't involve actually fetching and parsing the content. For instance, a Web address that includes something.com/2007/article123.html typically contains something posted in 2007 and not 2019. Obviously, we have to be very particular to make sure something.com/latest-company-results-best-since-2007 or something.com/article/12342007.html do not automatically match. For our customers, an article that by hook or crook doesn't get discovered by any of our services, but is more than two months old, won't get included. For example, collection results from something.com/2018/nov/14/ won't get included in December 14, 2018's collection. We have a whole system for being smart about dates.
3) Opinions and Letters to the Editor - While we can collect these, we currently do not. Our customers strongly prefer timestamped, fact-based occurrences. Some of the letters to the editor are simply sentiment gauges for particular organizations or topics, but nowhere near as accurate as doing sentiment on the fact-based articles.
4) Job Listings - We don't collect job listings. We've done some metrics on the number of open job listings and their titles as a measure of a company's health, but the actual information in the listings isn't considered significant for what we do.
5) Non-Standard Domains, Error Pages, Bad Characters, Paywalls, DarkWebs, and Portals - For whatever reason, sometimes a page was moved, removed, paywalled, or mis-typed. Depending on the Web software being used, there are certain types of patterns that denote errors. A permission denied, payment required, page not found, or an /error/ pattern are usually really good hints. Some Web pages misrepresent their character encoding too. Often this is tied to hiding content or scripts within the page. Likewise, some of the ephemeral domains and darkweb jumping-off portals that try to hide through site obscurity often get picked up. Even though they shift to new domain names, we have patterns that can keep them out of the system.
6) Online Pharmacies and Malware - One of the things we encounter a lot are repetitive URL paths. Malware vendors are a persistent bunch. They are constantly changing and relaunching their content. We encounter thousands of compromised websites that have "hidden" pages under their domain. Take the following examples:
The owners of the websites in some cases don't even know that their site is infected. Other times, website owners will sign up for some malicious service thinking that they are getting more web traffic, but really they are just being used to spam the web. After each "campaign" is blocked by enough services, the malware vendors slightly change the wording in the URL to get around the filters. The total volume of Viagara spam pales in comparison to the number of web pages promoting online pharmacies. Which brings us to the next blacklist subject ...
7) Obituaries and Pornography - There's no shortage of sex and death online. While the death of a CEO for a public company might be interesting in some context, it's always well-covered by the corporate news rather than the obituaries. Porn sites likewise have pretty colorful, but standard patterns.
8) Weddings, Lifestyles, Engagements, and Bridal Announcements - While we are happy for the happy couples, we eliminate whole sections of various sites simply by their categorization. While we actually do track some larger bridal outlets and markets, valuable news about them is never in any of these sections.
9) Paid Promotional Topics, Political Promotions and Celebrity - Similar to the online pharmacies and malware, it's sometimes easy to figure out when someone is paying for a huge promotion of some topic, event, politician, or celebrity. You will get similar or exact URL wording across multiple domains that all show up at about the same time, day or week.
10) Home Improvement, Personal Health and Gardening - There is a lot of content online for these topics. Similar to Wedding/Lifestyles, we blacklist these type of patterns even though we track some of these companies. The real, fact-based news doesn't show up in these sections common to a lot of sites.
11) Sports Scores - High school, college, junior high school, junior college, clubs, leagues, and even professional sports. Baseball, basketball, football, water polo, tennis, swimming, wrestling, lacrosse, golf, hockey, volleyball - we see it all. Results pages, rankings, tournaments, and seasons all are easy to spot with the right patterns.
12) Food and Dining - Restaurants and restaurant chain websites can be useful sources of corporate information using press releases or corporate announcements. However, articles about food and dining in the relevant sections of a news site are not.
13) Gambling, Fantasy Sports and Gaming - Sites dedicated to these topics are eliminated as uninteresting to our customers. Patterns of content paths also can identify some of the most "spammy" versions of these.
14) Coupons, Shopping and Cost Savings - There are a lot of companies trying to chase electronic commerce dollars across the web. Some of these are above board. Some are in the middle like "native content", aka content written about a product or service as if it was a legitimate news story. Most are nothing more than "look at me" spam trying to convert users to clicks or purchases. Many of these are discounts, coupons, secret shoppers, product subscriptions, or mystery boxes. Like other patterns, these are often embedded in the content paths of other sites.
15) Bulletin Boards, Discussion, Blogs, Podcasts, and Social Networks - This one is the most difficult. We actually do collect and use a lot of this content. For some of the sites, we have to strictly comply with the terms of service, collection requirements, and privacy restrictions on a per-site basis. A lot of this type of content is not as fact-based pure news, data, or web items, though some of our customers find value in the analysis we can do on it. Right now we only do text, but we've experimented with converting voice and video as a pre-process to our input.
All in all, we do a good job of keeping the bad content out and letting the good content in. We do this in an extremely automated manner. We're always looking for ways to improve it using various technical tricks and techniques. If the content patterns continue on-trend from last year, I think we're on track to blacklist a billion individual pieces of content sometime in late 2019.
While a billion pieces of data might seem small to some data scientists working on "big data" problems, it's just one piece of our puzzle and there are a lot of interesting things embedded in the big picture.