PageScan: A Deep Dive Into Securly’s URL Categorization Technology

Note: Lightspeed Systems’ VP of Marketing has distributed information about our URL categorization technology to 1000s of districts across US including to many of our customers without allowing us to investigate their claims. In our opinion, this breaks norms in the security industry for Responsible Disclosure, potentially breaks norms around advertising without reasonable substantiation, and also brings to question the very motive behind the approach and its relevance to actually helping student safety. Securly is choosing to keep this discussion focused on facts, technology and student safety. Please read our earlier response from Securly’s co-founder/CRO Bharath Madhusudan on this topic.

In this article, we provide a multi-step technical take on all the information being circulated. With facts and data, we proceed to distill truth from competitive marketing. The article is split into the following sections:

  • How does Securly’s URL categorization work?
  • Video: Our PageScan technology in action.
  • Securly’s take on why the information being circulated lacks scientific rigor, and reasonableness in advertising competitive claims.
  • Why did Lightspeed’s tests with PageScan fail?
  • How would Securly have fared if Lightspeed had used the PageScan cluster?
  • Examining the effectiveness of Securly’s PageScan technology against Alexa’s Adult Category of sites using a scientific experimentation framework.

How does Securly’s URL categorization work?

Securly uses a Machine Learning (AI) approach to building its URL categorization database. When Securly was founded 5 years ago, we started with a proprietary database we built via crawling of millions of websites, and we have continuously added (and deprecated) thousands of sites weekly to this database.

As a result, the technology is a combination of a static database and a machine-learning (AI) component called PageScan that continually looks at sites our millions of students are visiting, and categorizes them in real-time to be added to the static database.

Our approach is the industry standard in the Enterprise Security space, and is definitely not an approach that we claim to have pioneered. Any attack on the validity of this approach, is a claim against the entire URL/malware categorization industry. In their competitive study, LightSpeed talks about using a combination of “mature, comprehensive and accurate database; machine-learning AI; robot crawlers; and human review” – all of which are ideas used in combination by just about everyone in the web-filtering industry. This is a decade old approach that merits no advertising.

Here are the sources of websites used by our PageScan technology:

  • Crowd-sourced URL scanning: When a Securly student goes to a website that is NOT in one of our denylists, the website is let through the first time only. That website then goes through PageScan which dynamically categorizes the site as belonging to one of the following categories – pornography, gambling, games, and anonymous proxies. Securly’s 5M+ enrolled Chromebooks continuously feed information into PageScan’s database and improve coverage for all Securly-filtered devices including Macbooks, iPads and PCs.
  • Search-engine crawling: Apart from PageScan, we also crawl search engines periodically for top adult keywords kids could potentially be searching, and then update our databases with top 50 sites from each of these keyword searches in case we missed any site.

PageScan internally is split into 3 components:

  • TextScan – The actual page crawling and scanning technology that visits websites, and uses statistical models to determine if the text on the site’s titles, meta information, and pages are describing an adult site.
  • ImageScan – The second level of scanning when TextScan is unable to decide on a Good or Bad score. For example, when a site has no text to scan, or is in a language TextScan can’t handle today. Multiple images are downloaded from the site, and scanned using proprietary algorithms that detect pornography in these images. This technology is also capable of detecting other categories like violence, drugs, etc. but we haven’t yet productized that technology yet. ImageScan can only mark a site as Bad, and if it can’t, we rely on the third level of scanning.
  • 3rd-Party-Scan – 3rd party best-in-class paid URL categorization engines coming from paid subscriptions to companies focused exclusively on URL categorization. This stage focuses on URL categorization when TextScan & ImageScan are unable to mark a site as clean. These subscriptions are expensive enough to not be our first line of defense against new sites. They come into the picture only when Securly is not able to decisively catch adult sites algorithmically first.

Here’s the approximate sequence of events inside PageScan in the form of a simple flowchart. Almost all web-filtering companies would have something comparable in place.

PS_Flowchart

Here is what PageScan activity looks like (summer shows less activity than normal):

Screen Shot 2018-08-01 at 3.20.20 PM

Note: The time between when a student browses a website and when it gets added to PageScan is 2 seconds.

The video below shows PageScan catching sites we deliberately removed for a test account, and its process of automatically including and blocking new sites within seconds of the first visit by a student.

LightSpeed’s Competitive Study Lacked Statistical Rigor & Reasonable Basis

  1. It used thousands of sites from its own database to claim 100% coverage of these sites. This logical fallacy has been pointed out by the school admin community on many public forums.
  2. Lightspeed’s choice of websites skewed towards obscure, rare, and foreign-language websites. The vast majority of the sites Lightspeed pointed out to the community did not have an Alexa traffic ranking, and a shockingly high percentage of these were also foreign language websites. In other words, not only were these sites picked from Lightspeed’s own database, they belonged to a long tail representation of that database.

This and other biases and inconsistencies are clearly recognizable in Lightspeed’s advertised results.

Why did PageScan miss classifying some of these sites?

The answer is simple.

Lightspeed tested these URLs on our smaller US West cluster, where PageScan was deliberately disabled. The plan has been to rely on Securly’s US East cluster – which is comparatively larger with over 5 million students – as a pool for sourcing sites that can be classified by PageScan. We have since turned on PageScan for all of our clusters mostly to allow admins to run such kinds of tests themselves and not doubting the PageScan technology’s effectiveness by running it on the smaller clusters.

It must be noted that sites that were classified by PageScan were making it to our US West and all other databases. The fact that LightSpeed’s tests did not find them on the US West cluster further proves that the sites tested were statistically improbable to be visited even by a large group of 5M+ users over a period of many months. Having coverage for real sites that real students visit today matters, while coverage for hypothetical sites that hypothetical students could visit in the future doesn’t matter as much.

Responsible Disclosure would have allowed Securly to address such concerns ahead of a nationwide campaign to disseminate incomplete information.

How would Securly have fared if Lightspeed had used the PageScan cluster?

From various sources, we were able to determine the nature of the sites used by LightSpeed for their tests. Here’s an approximate breakdown of how PageScan treated each of the websites Lightspeed used for testing.

Keep in mind that this list is highly skewed towards sites that seemed to have been handpicked for being hard to detect (and from Lightspeed’s own database to guarantee coverage by them):

  • PageScan did extremely well on the sites that weren’t foreign language sites.
  • Sites that were missed by PageScan were almost exclusively foreign language sites where PageScan incorrectly marked them as Good as it had enough text to scan, but didn’t find adult keywords. We will use a simple Language Detection mechanism in the future to fix this important hole in our sequence, and have such sites move on to ImageScan & 3rd-Party-Scan stages.

How effective is Securly’s PageScan overall today?

The previous section covers PageScan’s ability to handle Lightspeed’s handpicked sites. However, we wanted to see how our PageScan technology performed in a scientifically sound experiment. Here’s how we designed our experiment:

  • PageScan scans for a multitude of categories. For this experiment, we focused on scanning Pornography which has been the subject of the current advertising.
  • We used Alexa top 500 Adult sites, an authoritative standard for site ranking that uses traffic volume as a metric, as the source of the sites in this experiment. The list is publicly verifiable.
  • We completely deleted our internal static denylist for this experiment, and relied entirely on PageScan. Under normal operations, Securly keeps an ever growing static denylist of sites to complement PageScan.

Result of our experiment:

Input to TextScan(PageScan’s 1st stage): 500
Domains marked dirty by TextScan: 270
Domains marked clean by TextScan: 46

Input to ImageScan (PageScan’s 2nd stage): 184
Domains marked dirty by ImageScan: 38
Domains marked clean by ImageScan: 0 (ImageScan never marks any site clean)

Input to 3rd Party Scan (PageScan’s 3rd stage): 146
Domains marked dirty by 3rd Party Scan: 112

Total domains detected by PageScan: 420 out of 500, or 84% effectiveness.

In other words, even if we flushed our entire static database today and relied exclusively on AI based PageScan technology, we would still be catching 84% of the Adult sites being visited by students today. This is a powerful statement. It reveals how this approach can catch up 84% with old static databases “overnight” using a combination of AI & 3rd Party API subscriptions.

Total domains detected by PageScan + Static Denylist: 485 out of 500, or 97%

Total domains detected by PageScan + Static Denylist factoring-in parked-domains or ones that are now defunct: 498 out of 500, or 99.60% missing a total of 2 sites out of 500.

The few sites that the combination of AI, our static deny list & 3rd party scanning all missed collectively were sites below (NSFW!):

  • Text on the page not enough to classify as porn (only 2):
    4evermodels.com
    pinmodel.com
  • Dead/parked domain or not pornographic (13):
    acecumshots.cp
    sexytera.adult.directnic.com
    porn-amateur.net
    assess2die4.com
    balloonzone2000.com
    big-cock-sex.sexyteenz.cz
    Blacksex.eroticblacks.com
    adulttravelguide.com
    asianputang.com
    balloonbeauties.com
    barbeint.net
    wettingvideo.ml
    sylviasupersite.adult.directnic.com

The 2 sites we missed out of 500 were marked as clean sites by TextScan as it didn’t find enough text on the main page that was Adult in nature. We have to balance the sensitivity of TextScan with False Alarms from it as overblocking in schools impedes instruction. Along with the idea of deferring foreign language sites to ImageScan & 3rd-Party-Scan, we will explore ways by which we can get better at handling sites with ambiguous/sparse description on the homepage before marking it clean. There are many other ideas we will be working on including additional heuristics, better crowdsourcing, sentiment analysis, etc. to improve our coverage, but as this test shows, Securly’s effectiveness is at a point where our 2000+ customer districts do not consider our coverage as a weakness at all.

We once again encourage the community to read the note from our co-founder/CRO Bharath Madhusudan here. Madhusudan discusses the mission & team behind this technology, and that is as powerful a story as is this technology we have built over the past 5 years.

Conclusion

The competitive claims made against our technology lacked scientific rigor & reasonable standards, had logical fallacies, broke responsible disclosure norms, and were based on an incomplete understanding of how Securly works internally. Furthermore, LightSpeed itself uses comparable approaches, and so does the entire web-filtering industry, for decades. With this technical report, we have attempted to explain how an industry-standard world-class system like ours works, and what kind of effectiveness to expect out of such a system.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s