Answer
Applies to:- GFI WebMonitor 2013 and 2015
WebGrade Classification – a unique and “best of breed” approach
GFI WebMonitor uses a combination of automated classifiers and humans to classify websites. GFI WebMonitor achieves high volumes and accuracy and it covers more of the internet than anyone else through this unique technology and approach.
Web Analysts classification
The WebGrade URL classification process uses Web Analysts - people who specialize in various languages and cultures to classify websites. Each Web Analyst goes through a rigorous selection and training program before they are qualified to classify websites and feed information to the Master Database. This flexible team can grow and shrink as necessary and works remotely with tools we have developed to classify sites fed to them on a priority basis from the central operations office.
Web Analysts’ classifications can be cross-checked for accuracy, and monitored for productivity, speed, and consistency. Although these classifications are published in the various GFI WebMonitor service feeds and databases, the more important use of this human-generated data is to train the automated machine learning classifiers.
Web Analysts’ classifications can be cross-checked for accuracy, and monitored for productivity, speed, and consistency. Although these classifications are published in the various GFI WebMonitor service feeds and databases, the more important use of this human-generated data is to train the automated machine learning classifiers.
Automated Classifiers
GFI WebMonitor’s uses a Maximum Entropy Discrimination Content Classification algorithm and supporting URL processing system to spider and crawl the Internet, retain content information, and classify websites based on their security attributes and content. Maximum Entropy Discrimination is a next generation derivation of the Support Vector Machine algorithms that have been employed by others for some time in the content classification space. MED algorithms are good at dealing with noisy input data and making fine grained decisions on close calls - both attributes that are very important for website classification given the extremely noisy and inconsistent data on the web. Individual “models” are built for each of the WebGrade categories. Models are “trained” using human reviewed websites as positive and negative examples within each category. All websites are evaluated in all categories using the models. Each category’s model outputs a confidence score that is then published if it is above a certain threshold value. Thus a website may have zero to many categories to which it belongs depending on the model outputs of these confidence scores.
It is important to note that confidence scores should not be construed to mean that a site is “more” in one category than another, rather only that within a category, one site may have a higher confidence than another site in that same category. For example:
Abc.com = Gambling at 89, Sports at 94
Xyz.com = Gambling at 95
So we can say with some certainty that Xyz.com is more likely a Gambling site than Abc.com. However, we cannot say that Abc.com is more likely a Sports site than a Gambling site.
When implementing policy for enterprise protection, blocking URLs using higher confidence scores equates to lower false positives, but also allows in some unwanted content. For example, if an enterprise were to block URLs in the gambling category with a confidence of greater than 95%, they could be highly certain that blocked URLs were in the Gambling category, and the false positive rate would be low.
However, many gambling sites with confidence scores below 95 could be accessed by their end users. Recommended internet usage policy is to block access to websites with confidence scores of 70 or above, which is the default setting.
However, many gambling sites with confidence scores below 95 could be accessed by their end users. Recommended internet usage policy is to block access to websites with confidence scores of 70 or above, which is the default setting.
Comparing Automated and Human Classifiers
GFI Software has found that error rates of Humans and Automated Classifiers are very close in numbers. Humans operating over an 8 hour day and evaluating thousands of URLs can make mistakes due to fatigue or unfamiliarity with the topic covered by a particular website. Automated Classifiers don’t get fatigued, but given the breadth of the content on the web, can be given training data that leads to occasional incorrect categorizations.
A key observation is that these two types of classifiers are inaccurate in different ways and under different conditions, which means that they can be used to cross check each other in their areas of relative strength, each supporting the other. Through this differential feedback loop, MED accuracies approaching 98% can be achieved.
A key observation is that these two types of classifiers are inaccurate in different ways and under different conditions, which means that they can be used to cross check each other in their areas of relative strength, each supporting the other. Through this differential feedback loop, MED accuracies approaching 98% can be achieved.
URL Sources
New URLs are gathered through constant spidering of the Internet, monitoring of GFI WebMonitor’s
customer traffic, and from select third party sources.
Other Classification Steps
GFI WebMonitor has many other classification processes and steps that all feed into hosted service, such as:
- Examining websites for security violations and vulnerabilities.
- Determining IP addresses of websites and perform correlations between sites on same or similar IPs.
- Comparing GFI WebMonitor’s web-filtering results with freely available public lists of SPAM URLs, security URLs, Phishing URLs, Proxy URLs.
- Tracking website longevity to determine how long they have been in existence.
- Adding ranking information from many popular services as well as our own internal frequency and access data based on GFI WebMonitor’s online lookup service usage.
Languages
GFI WebMonitor’s language coverage is comprehensive and worldwide. Our Web Analysts cover all the major languages across the globe, and most languages spoken in smaller countries are included in the data set as well. A key technological advantage enjoyed by GFI WebMonitor is that the automated language classification process is language independent - once a given language is incorporated into a particular machine learning model, that model is now capable of categorizing websites in all languages on which it has been trained.
This capability enables GFI WebMonitor to keep up with the fast paced growth of the internet across the globe.
This capability enables GFI WebMonitor to keep up with the fast paced growth of the internet across the globe.
Publication
GFI WebMonitor “publishes” information from the huge repository of URL information available from the hosted service continually. During this step the best possible classification for each web site is determined based on many factors including
- Classifier’s ID (human, automated process, etc)
- Time of last classification
- History of past classifications
- Ranking of the web site in popularity
Summary
GFI WebMonitor provides industry leading web filtering protection, providing our customers with the highest web filtering service level available, resulting in lower uncategorized rates, and therefore a higher degree of enterprise productivity and protection. These best of breed URL categorization technologies, systems and processes enable GFI WebMonitor customers to rest assured they have the highest quality technology delivered in a cost effective solution.