The Fight Against Spam, filtro de mail de Apple.

19 de mayo de 2004 - 12:08 - Infórmatica

pam has become a supreme annoyance on the Internet. Everyone has to deal with it, just as everyone has to deal with telemarketers and mail-order catalogs in the real world.

However, assuming that we cannot get totally rid of it, spam can, to a large extent, be avoided by following a few simple rules. My goal in this series of three articles is not to provide you with the ultimate, fool-proof anti-spam strategy. Why? Because there isn't one, and I would be lying to you if I wrote that there was. What I will try to do is to list a few common-sense, easy-to-follow rules that should allow you to spend most of your time on the Web without having to worry.

In the first part of this series, we're going to focus on defining spam -- not an easy task, despite the appearances -- and see how you can start fighting against it. Once you have followed these steps, you will be just in time to read the following installments that focus on fine-tuning our strategy. They also feature an exclusive interview with Kim Silverman, principal research scientist and manager of spoken-language technologies at Apple, about Mail.app's junk-mail filtering capabilities.
Before We Start

As in my previous article, "A Security Primer for Mac OS X," let me remind you that your own needs may vary from what is listed here. This article is intended for home users and small businesses, but multinational companies or users who handle an unusual amount of mail every day will probably want to seek professional help and to rely on custom hardware and software solutions.
What Is Spam, Anyway?

When we say "spam" here on the Mac DevCenter, we rarely speak about the "tinned luncheon meat made largely from pork, developed in 1937."

Related Reading

Mac OS X: The Missing Manual, Panther Edition
By David Pogue
Table of Contents (PDF)
Index (PDF)
Sample Chapter

Read Online--Safari Search this book on Safari:

Code Fragments only

The definition of spam varies greatly from user to user, therefore raising issues in the detection and reporting processes. However, mails that have been sent to multiple users without their prior consent is generally considered to be spam.

An email sent on your request or sent to you specifically -- by a relative, a coworker, or someone who wants to hurt you in some way -- is technically not spam, although it can be just as dangerous and bothersome.

Emails sent by viruses in order to propagate themselves are usually not considered spam, although they can have a similar effect -- and be sometimes even worse, since their attachments weigh a lot and eat precious bandwidth.

Notifications sent to you by an overzealous provider about network status, bounces, delivery failures, and viruses are not technically spam, either.

Bounced spam is trickier: a spammer may have impersonated you and you are just receiving emails that did not reach their destinations or were bounced back by users. While you are technically not directly spammed, it is important to react quickly since the situation can quickly become unbearable.
Is There Such a Thing as Legitimate Spam?

In a way, yes -- although it probably shouldn't be called spam in this case, but "bulk mail." Many, if not most, web sites will ask you whether or not you will allow their "partners" to send "information" and "promotional offers" to the email address you provide to them.

As soon as you give your consent and allow multiple companies to use your address, the advertisers sending you mail are not necessarily at fault -- unless you can prove that they are sending you mails that pose a threat to the normal operation of your network or computer.

Most countries have specific laws regarding the use of contact information by third-party companies, making it difficult to establish what is and is not legal in your area. As a general rule, however, you can expect a site to follow the rule of the country it is located in, and not yours, even if it is stricter. That's why you should always have a look at where the company you are dealing with is located. A few countries ask foreign companies doing online business on their territory to follow local regulations, but unfortunately, the lack of a worldwide law enforcement system in such matters makes it almost impossible. Whether this is a bad or a good thing I don't know.

Usually, legal "spam" (notice the quotes) can be stopped: simply ask the company that sends it to you to stop, and it should work. If you do not want to receive the O'Reilly newsletter, contact O'Reilly: this will work much better than setting filters for it in your mail client.

Therefore, the absolute first step in any anti-spam strategy is to go through the list of your "spammers" and to ask yourself what can be stopped peacefully and legally. While this may not account for the largest part of the promotional mails you receive, it is guaranteed to make a difference. This step is often overlooked by users who receive so much spam that they can no longer take the time to ask themselves whether they signed up for it or not.

Here is our first anti-spam tip: never, ever allow a company to send your address to "partners." Why? Because you may not know who these partners are, and this will make tracking down the source of legal "spam" much more difficult, even if a serious company has a good chance of having selected serious partners. It is also a good idea to maintain a list of the newsletters you are subscribed to: write down their names, the companies' URLs, and the opt-out procedures that should have been clearly explained to you when you signed in. Such information is extremely useful and often hard to find after a few months!

Within the "legal spam" category falls another that is rarely talked about: all of the promotional emails and newsletters you signed up for but cannot stop, for some reason. Since you signed up, it's technically legal, but the fact that you cannot stop them once you don't want them any more makes them look frighteningly similar to spam. Some companies -- or at least their online marketing departments -- actually engage in such practices, so watch out before signing up!

A good place to look for such clues is Usenet. Luckily, you can browse most of the posts through services such as Google Groups that do not require any setup on your end. Google Groups contains the entire archive of Usenet discussion groups dating back to 1981. Of course, you will find very diverse -- even opposite -- opinions, slandering, and strong language in these groups too, so read with care.
Who May Receive Spam?

Anyone may receive spam. More precisely, any active user on the Internet who uses an email address and sends it to third parties.

Did you post your email address on your site or on a forum? Well, there are robots specifically designed to read millions of web pages, extract any email addresses they can find from them, and add those addresses to mailing lists. Some forum software packages actually create forums that are so complex that most robots get stuck and never get to actually read the addresses; WebX, for example, is supposed to be quite spam-resistant. You should, however, treat every forum equally and avoid posting your address without scrambling it.

Do you send mail to PC users? Well, they may receive viruses that will read their address books and, while sending you dozens of infected mails per day -- which are, if you remember what we said above, not "spam" -- will also subscribe you to lists and flood your inbox with messages.

A less common but equally frightening case: some people use anti-spam software that subscribes you to lists, and you begin to receive even more spam than you can accept, a "fight back" way of protecting oneself. Unfortunately, since addresses are easily spoofed, this means that these applications very often end up punishing the wrong person.

A little unsettling, isn't it? Luckily, there are ways around most of that, so don't panic. However, it's important to realize that even someone who leads a perfectly respectable online life and is cautious may receive spam.
Help! I'm Already Flooded with Junk Mail
Create a New Address

If you're already flooded with junk mail, the easiest, most effective way to get rid of it is to create a new email address. Indeed, spam can reach a point where deleting it and looking for legitimate correspondence in your inbox slows you and your work down.

It can also be dangerous, transforming your mailbox into a floodgate for malicious code. Imagine what can happen the next time that you check your mails from your work PC or on a friend's XP Home machine!

Of course, creating a new address alone won't help; you also need to understand at what point your address was revealed to spammers. Otherwise, you may well end up creating a new address every few weeks -- and this definitely isn't practical.

One of the biggest issues when creating mailboxes is letting your correspondents know about them. In fact, many users never do this because they fear that they are going to lose customers, friends, or other contacts they may have. This is a legitimate fear, but everyone moves and changes addresses in the real world, too. What can be managed in life should normally be manageable online!

Obviously, you cannot set your old address to send auto-reply mail containing your new address. Otherwise, you would simply send your new address to spammers even before all your legitimate correspondents have had the time to learn about it. Worse, should one of your correspondents have an auto-reply system too, your two mail servers could enter an auto-replying loop, filling your mailbox and preventing other legitimate users from receiving the new address notification.
Using Address Book to Solve Transition Issues

Chances are that the last time you moved, you had to send cards to everyone to make sure that they were aware of your new contact information. You can do the same online by using the Panther Address Book and its great Send Updates feature.

The Send Update feature will automatically send your new contact information to a group of people, by clicking on a few buttons. A lot easier than doing things manually, isn't it? Of course, it sends your information as a vCard, ensuring cross-platform compatibility and consistency in what you send -- so you won't make a typo in your new email address on half of the cards you send, something that can happen when writing hundreds of notes in a few days.

To send the update, here are the steps to follow:

* Select your card in Address Book and make sure that it is up to date. Also, make sure that it is marked as "Your card." You will see "me" written on the picture you have set.
* Create a group: Open your Address Book and select the "Card and column" view by using the switch located on the top left of the window. Notice the "+" button at the bottom, on the far left. Click on it to create a new group and give the group a meaningful name, such as "New address mailing."
* Populate the group: Click on the "all" group and pick the cards you want to put into the other. To click on multiple contiguous cards, hold down the shift key. To pick cards at random, hold down the Apple key. Once you have selected the cards you want, drag them over the new group icon and drop them. This will populate the group you have just created.
* Make sure that you use the right address: If your correspondents have multiple email addresses, you can use the "Edit distribution list" feature, available through the "Edit" menu, to select the addresses to which your note will be sent.
* Once you are all set: Use the "File" menu to chose the "Send Updates" menu item.
* In the window that appears: Select the group to which you want to send the note. In our example, this is the group we just created.
* Then, enter a title and a message: Try to make the title and message personal enough so that spam filters don't stop it and that your correspondents actually read it!
* Once you are ready, click on "Send": A few seconds later, you will hear the mail-sending sound from Mail.app.

Of course, while sending an update, make sure that you don't send it to a potential spammer -- in case you have companies in your address book -- or to PC users who collect spam-inducing viruses on their hard drives. You should also make sure that Mail is properly set up and doesn't display the addresses of all of the members of the group. Revealing the addresses of your correspondents can cause the (justified) ire of some of them -- and is also a great way to promote spam if one of them uses an virus-infected PC.

Here is a privacy-related tip: before sending out your card, drag it onto the desktop to export it and open the resulting vCard in TextEdit. You can do so safely since vCards are nothing more than a text document in disguise. This will reveal the actual contents of the card and help you make sure that it doesn't contain information that you don't want to share, such as an email address or a custom category.

Address Book also has a very nifty feature called "Enable Private Me Card," accessible through the vCard preference pane. When turned on, this feature allows you not to share some of the contents of your vCard. This can be very handy if you want to create a "meta vCard" on which you have all your contact information, and pick on the fly what you want to share. It is, however, always a good idea to make sure that it is properly configured before sending the information out.

In the same pane, you will see a checkbox called "Export Notes in vCards." You can use this to add a comment to your own vCard that you will hand out. This can be a short bio or a note that explains your address change and apologizes for the inconvenience this may cause.

Carefully Choosing Your Email Provider

There are thousands of email providers out there, some free, some fee-based. However, as easy as opening an email account somewhere may seem, it is important to pick your provider carefully and to ask you not only what mail box size they offer (you will rarely use more than a few MB and even the ones offering tons of space restrict attachment size, making this feature somewhat less attractive), but what features they provide and how they fight spam and viruses.

Of course, even the best provider cannot prevent all spam from reaching your inbox, but server-side filtering can make a huge difference. In my experience, Apple's very own .Mac mail is extremely resistant to spam. Also, the support teams do reply to your inquiries and are extremely helpful.

As a way to test whether your mail provider filters for viruses, you can send yourself an EICAR.COM test file. These files are not actual viruses but are used to trigger anti-virus systems and test them. To create an EICAR.COM file, enter the following string in a new TextEdit text-only document:

X5O!P%@AP[4PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*

and save it. Test it with your anti-virus software and make sure that it triggers an alert. If it doesn't, make sure that you have created it properly. Then name the file EICAR.COM, attach it to an email and send it to yourself. Good email providers should stop the file in transit or provide you with a warning.

Figure 2: Virex and the EICAR.COM file

Of course, since this file actually triggers anti-virus systems, it is a bit like testing the smoke detectors of your local supermarket by smoking underneath them. It can cause unnecessary concern and be illegal in some areas, so, please, do check with your provider first whether this is permitted or not. As we said, emails generated by viruses are not technically spam, but they can be so devastating that checking whether you are protected against them right now cannot hurt.

It is generally a good idea to pick an email provider that is independent from your ISP. That way, if you need to switch ISPs for any reason, you do not need to change your email address.

Webmail, IMAP, and SSL are three features that no Mac user should be without, either. Make sure that they are available when you sign up. When a provider states that "SSH tunneling" is required for secure mail reading, this is both bad and good news. It means that they know something about security (a plus), but that checking your mail will likely involve Perl scripts and shell commands (a huge minus for most users).
Carefully Picking Your Email Address

When you sign up for an email service, you are usually encouraged to select a cool, easy-to-remember address. However, this is not always a good idea.

Indeed, spammers now use nifty robots that invent addresses by compiling common user names with common domains. For example, if your name is "John Smith," you are guaranteed to be spammed if you pick "smith," "john," or "jsmith" as your email address. The same applies to nicknames like "Bill," "Geek," or "Superdude."

That's why your IT manager at work may have assigned to you an address that contains strangely placed dots, dashes, or underscores. Sure, it may be a pain to type sometimes, but it can also be a lifesaver. Apply the same rules to your home email and the amount of spam you receive should decrease.

Of course, the chances increase if your address is hosted on a commonly used domain such as Hotmail, Yahoo, or the like. Don't get me wrong, this does not mean that there is something wrong with these domains. They simply make a more tempting target since once is almost guaranteed to find a match for any name there.
Using Multiple Addresses

This may sound silly and expensive, but it is now a strategy that you should consider. Some tutorials advise you to create two different accounts; I would suggest using three.

That way, you can have one account to receive email from trusted people. In other words, any user that is technically minded enough not to submit this address to a spammer and to protect her computer against viruses and trojan horses. The trusted group can also include very important people for you -- your boss, your close relatives -- but do make sure that providing them with your address does not ruin all of your anti-spam efforts.

The second address will be for a semi-trusted group. In other words, the general public, your customers, and your extended family. You can expect to receive a certain amount of spam on this address and should exercise caution when checking it. Of course, this does not mean that the people you give the address to are "semi-trusted" as individuals, but simply means that this address will circulate a lot more around the Internet and could potentially be intercepted.

The third one will be your junk address, the one you will give to untrusted companies and people you don't know. Of course, you should still be prudent. The fact that you can throw this address away does not mean that you should knowingly allow spammers to use it. Why? Because it would make checking it a lot more difficult, and even potentially dangerous.
Additional Tricks to Create a Well-Protected Address

Now that you have created these addresses and paid your yearly subscriptions, they should be relatively safe and spam-free. However, if you use them heavily, there are additional ways to protect yourself.

The easiest precaution is to create a screen of smoke and dissociate the address you give to people from your real one. This may look like a superfluous step, but it can be extremely effective. In fact, more and more, people I know use this tactic every day.
Register Your Domain Name

We have seen that commonly used domain names are more commonly used as targets to attacks. Why not create your own? Some registration services allow you to register your own domain for a low price.

Even if you do not host a web site, having your own domain will increase your chances of not receiving spam and will also make your email address look ultra-cool. Families, friends, or small businesses can create a common domain name and have separate addresses to share costs. Just make sure that you establish in advance who will be your postmaster.

Of course, you should make sure that the company that you deal with to create your domain name is a trusted one. Also, some countries may not allow you to register a domain or restrict the process: always ask your legal advisor before purchasing one. If such limitations not exist where you live, please, do respect naming conventions: .com for commercial sites, .org for non-profit, etc. This will make things easier to remember for your correspondents. And, let's face it, it makes more sense.
Set Up Mail Forwarding

Now that you have set up your domain name, it is time to create inboxes associated with it. However, professional mail services and customized mail servers are not cheap.

Therefore, you can simply set up mail forwarding to your existing addresses. That way, you can give a professional-looking address to your correspondents and keep your "real" address for you. When they receive a reply from you, your correspondents will be able to find out what your real address is, but if you receive spam, you wouldn't reply anyway. If you're willing to go the extra mile, you can have a custom SMTP server set up for a few dollars a month. But at this point, it may be simpler to get a "professional" email account.

Forwarding in itself cannot protect you against spam. However, what makes this method interesting are the spam filtering and anti-virus scanning systems provided by your forwarding company, meaning that the mails that you receive will travel through two layers of scanning: the one set up by the forwarding company and the one set up by your actual email provider. Since spam can go through various detection software, having multiple layers that use different engines will greatly improve their efficiency.

One of the other advantages of this method is that it allows you to create disposable addresses extremely easily. Many forwarding services allow you to create a few addresses for a fixed price and to change them as often as you please.

With such a setup, you can create a bogus username such as "spam_from_strange_site.april_04," send it to a site you don't trust, and once you have the information you want, destroy it. This is much easier to do than opening a free mailbox somewhere, and has the advantage of not cluttering your provider's customer database with unused mailboxes that can ultimately raise a security concern -- if you forget about them and someone breaks into them to perform illegal actions, for example.

Of course, we are not talking about anonymity here, just protection from unwanted mails. When you register a domain name, you are normally required by law to give valid contact information.
Setting Up Your Email Client

Now that you have a perfectly well-chosen address, safely put behind a smoke screen that allows you to give various identities to various people without paying a cent, we need to see how you can protect yourself in the long run.

The easiest way to do that is to use a good email client and to set it up properly. Email clients are like browsers: they allow you to interface with an open world in which the best and the worse coexist, which makes them extremely important. They should provide a good balance between security features and flexibility.
Which Client?

Email clients are not created equal. However, nowadays, it's impossible to say that one client is "good" and that another should be avoided at all costs. Most of them have pros and cons and you will probably find one that best fits your needs.

In this article, however, we will have a look at Mail, the client that is built into Mac OS X. Why? Well, it is free, is capable of handling huge amounts of mail, is quite powerful under its user-friendly interface, and is perfectly integrated with iChat and Address Book. However, the main reason is that it features a state-of-the-art "Junk Mail filter," developed by the world-leading scientists that work on Mac OS X's language technologies -- which include the Speech technologies I discussed last month.

Even if you use another client, you will want to read the following paragraphs. The advice they give can be easily translated (for the most part, at least) and you may actually discover that the application you have always dreamed of is right at your fingertips.
Mail Tips

In a successful attempt to make it even easier to use for newcomers, the Mail development team has designed an interface that allows users to access emails directly. That's great, but for various reasons, heavy mail users will want to turn off some of these features.

The first feature to disable is "Display images and embedded objects in HTML messages." To do so, simply uncheck the corresponding checkbox in the "Viewing" preference pane.

Why? Because many spammers use HTML as a way to check whether or not your address is valid. When this option is turned on, your computer will download any image that the mail contains, in order to display it properly. By doing so, this alerts the spammer that the mail has indeed reached someone and that, therefore, the address is valid.

Most legitimate mails do not use HTML code or, at least, images, but these are sometimes used only by companies who wish to send attractive advertisements and newsletters. If you receive legitimate HTML mail, Mail will display a button as soon as you open it, allowing you to load the images on the fly, viewing them as the original author intended.

If the companies you deal with give you a choice, I would recommend that you chose to receive text-only emails. They weigh a lot less, won't clutter your mailbox, and won't take hours to download from your mailbox -- an especially good point if you are on the go, away from your broadband connection.

The second setting to alter can be found in the "Advanced" tab of your account preferences. The "Keep copies of messages for offline viewing" pop-up menu allows you to specify whether or not Mail will download attachments automatically. Unless you cannot do so for a specific reason, I would recommend that you download messages but omit the attachments. Why? This will make Mail faster and allow you to avoid downloading malicious attachments to your computer.

The final step to take is to prevent Mail from automatically loading the messages you receive. As long as you follow the steps above, you should be safe, but it cannot hurt to add a layer of security.

In order to do that, look closely at the line that separates the mail list with the viewer area: it has a small dot in the middle. Double-click on that dot so that the line moves to the bottom of the window. Do not drag the line, since this would resize the viewer instead of closing it, even if you make it really small. Now, you will need to double-click on the emails to open them, but you will also be able to delete junk mails without actually opening them.
Next Time

In part two, which will run this coming Tuesday, I'll drill deeper into Mail.app, especially examining the underpinnings of its junk mail filter. Be sure to stop by for a look.

In Part 1, I focused on laying the foundation for an anti-spam strategy and covering how to block most of your unwanted mail. In today's article of this three-part series, I'm going to fine-tune this strategy, plus take a closer look at Mail.app, so that you can more fully unleash its potential.
The Real Show Stopper: Mail's Junk Mail Filter

Created by the engineers who bring the Japanese input method and the Speech technologies to you, Mail's junk mail filters are outstanding. When trained for a sufficient period of time, the filters can reach 98%+ accuracy against spam and are surprisingly painless to use. In fact, this feature alone has convinced many users to switch to Mail.
How Does Junk Mail Work?

Author's note: Kim Silverman, principal research scientist and manager for the Spoken Language Technologies at Apple, helped as I prepared the following paragraphs. I appreciate the information he so kindly provided. Needless to say, if there are any inaccuracies, they are entirely mine.

Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic. To truly understand what makes it so much better than the competition, we'll have to take a closer look at the recognition engine and the technologies it relies on to do its work. It may sound a bit complex at first, but things will begin to make sense as we work through the mechanics.

Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system, developed in the Apple labs to help users who managed thousands or millions of large documents find the one they were looking for easily. In order to do that, this technology had to allow users to perform a search by topic.

Related Reading

Mac OS X: The Missing Manual, Panther Edition
By David Pogue
Table of Contents (PDF)
Index (PDF)
Sample Chapter

Read Online--Safari Search this book on Safari:

Code Fragments only

The traditional approach to this has been called "vector representation." Imagine a huge table in which each column is labeled by a word in the union of all the words in the document. Every row is labeled by a document. And every cell contains the number of times that word appears in the document.

Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.

Here comes the interesting part. Since every document is a point, you can cluster them. Cluster analysis will find groups of points (sometimes called "clouds") in a graph that consists of multiple, unevenly spread points. It will then tell you how these clusters describe the overall spread of the points.

That's what we do with our files, and all the documents in a cluster tend to be about the same topic. The part of Mac OS X that does all that is called the "Apple data kit." It's an engine that specializes in vector representation and can be used to find documents, sort a corpus into topics, and yes, it even auto-discovers them. The Apple data kit allows the user to find the single document that best represents each topic. Best of all, it also produces a summary of a document. That's what allows the accompanying AppleScripts to write summaries of your reports (this is called Summarize, located in the Services menu for Mail.app).
The Joys and Pains of Vector Representation

The main advantage of vector representation is that this technology does not rely on word order to do its work -- you can have a look at our speech article to learn more about why this is important.

The representation looks very much like a "bag of words," since it is based on the total number of times a word appears in a document. Documents about the same topic will usually contain similar words.

Also, whereas statistical language models capture local patterns only to do their work, vector representation captures non-local patterns. So, a document that contains "Aunt Emma" and "cooking tips" at the beginning and the end of a page can well be in the same cluster as a text that talks precisely about "the time Aunt Emma sent you cooking tips."

However, as with every technology, the benefits come with a few drawbacks. First of all, since the dimensionality is huge, it is computationally expensive. Also, since most words do not occur in any particular document, there are lots of zeros in the numbers that represent them. In mathematical terms, the matrix is sparse. Do you feel lost? Imagine this: take the biggest issue you can find of the Mac Developer Journal and put it in your left hand, and put your favorite dictionary in your right hand. How many words in the dictionary can you find in the Journal? Not many.

These "details" explain why clustering doesn't always work so well.

Also, most counts are low, and therefore inaccurate since they can more easily contain sampling errors. Let's say, for example, that your Aunt Emma, in her cooking tips, talks about a "hippopotamus" (as in "For the turkey to be tasty, it should be quite large but obviously, you don't want a hippopotamus-sized one."). The fact that she used it once does not mean that she will use this word again in her cooking tips. This phenomenon is called "noise."

To address all these issues, and reliably recognize the topic of documents, we need to jump into Latent Semantic Analysis.
Latent Semantic Analysis to the Rescue

To make up for the shortcomings in vector representation, we use something magical called "Singular value decomposition." It reduces the dimensionality, gets rid of the sparseness, and statistically finds the regularities in the noise. In other words, it captures the underlying stable pattern in the data we have. In case you're wondering, this involves using regression lines, but that's another story.

If each document is a point in a X0,000-dimension space or so, we reduce its dimensionality into a small number of dimensions that capture the salient patterns and the majority of the variation in the corpus. Then, we can do the Latent Semantic Analysis. In this new space, each axis is a weighted combination of all the words: documents and words coexist in the same space.

Like we did before, you can perform a bit of cluster analysis and find clusters of documents that each represent a topic. You now have under your eyes a computational representation of semantics.

Because words are distributed in the same space as documents, you can find the words that are closer to the center of a document cluster. Those will be the words that characterize the meaning of the documents in that cluster, even if a document does not contain all those words.

So we can find words that describe a document without requiring that they be necessarily found in the document.
Everywhere on Your Mac, for Your Pleasure
Even though Apple is not the only company working on such technologies, they do seem to be the only ones to have made it so accessible to end users and powerful at the same time. In fact, they do it so well that it is now at the center of many system components as we have seen, requiring them to continuously refine the calculations and develop the formal mathematical representations -- all for your benefit.
How Does This Apply to my Spam?

So, we've endured lots of math. But now, let's get back to our main topic and see how this math applies to your spam.

There are two traditional approaches to spam. The first looks for keywords in a message and flags any mail containing those words as spam. This has a major drawback. What if your Aunt Emma happens to mention to you as an aside, in a very important email about a family gathering supposed to take place in a few days, that your uncle had an opportunity to take Viagra? The mail will be flagged and deleted, causing you to miss the gathering -- or, if it were in the business environment, potential revenue.

Of course, systems that rely on such keywords are continuously updated and refined. Nevertheless, they are never entirely satisfying, even when using sophisticated Bayesian filters that are essentially weighted keyword systems.

The other traditional approach is to look at the sender and not accept any message from any known junk-mail sender. However, this is even less likely to work since junk mailers keep changing their addresses. Some people have proposed that you only accept mail from senders in your address book, but for obvious reasons, this isn't realistic.

That's why latent statistical analysis is much better. It doesn't make binary decisions based on any single characteristic of a message. It analyzes the meaning of the words and acts accordingly.

And to make this work even better, you can add your own rules to Mail.app to shape its behavior.

Figure 1. Spam message flagged by mail.
Why Make it Trainable Then?

A common question about the spam filter in Mail.app is why the Apple engineers decided to make it trainable. After all, if it truly understood the meaning of a mail, it would immediately see what's junk and what's not, right?

Well, not exactly. Let's imagine that you, like most Mac users, are constantly receiving spam about mortgage opportunities. Mail would naturally flag them as junk. But what if you were in the market for a house and had requested quotes from legitimate companies? This is when the ability to train Mail comes to the rescue. You may want to alter the rules while you shop for a mortgage.

Does it Work with Other Languages?

Mail is often criticized because the system it uses "only reads English." Nothing could be further from the truth. Mail does accurately flag messages in other languages. The corpus on which it is pre-trained uses mail in different languages, and it is just as trainable in German or Japanese as it is in English texts -- thanks to a few other cool Apple technologies regarding tokenization that go beyond the scope of this article.
This Sounds Complex, Should I Disable it on my iBook?

Don't worry. Even though Junk Mail relies on very complex technologies, it's very efficient and easy on the computer, even on slower G3 laptops.

This is a good example of expanding capability without sacrificing performance, by writing good code.
An Introduction to Using "Junk Mail"

As soon as you launch Mail, the Junk Mail filter is turned on in "training mode." As long as training mode is on, Mail will display all the messages you receive in your inbox, including the junk. However, potential spams will be marked with cute, paper-bag icons and will appear in a disgustingly distinctive brown color, making spotting the unwanted messages easy.

If you notice a message that is incorrectly flagged as junk, simply open it and click on the "Not junk" button located at the top of the message in the brown banner. If you notice a message that should be marked as spam but isn't, select it and use the "Message" menu to "Mark it as junk mail." Alternatively, you can place a "Junk" button in your toolbar; simply use the "View" menu to customize it.

As soon as you mark a mail as Junk or Not Junk, the junk mail filter will fine-tune its analysis, learning what you consider to be junk and what it should let go through to your inbox. This simple-looking learning capability is actually what makes Mail amazing and very different from its competitors.

For most people, Viagra ads are spam and gardening-related messages are updates from their grandparents. But what if your grandparents like to talk about Viagra and you are being spammed by a gardening service? While most other programs won't be able to adapt to your situation, Mail will, and effortlessly.

Once you're satisfied with the accuracy of its analysis, you can switch it to "automatic" mode.

Figure 2. Mail's junk preferences.

As soon as automatic mode is turned on, any mail flagged as junk mail will be moved to a special Junk mailbox. Of course, you are still responsible for what happens to this mail. Should it be deleted? Kept for archiving> We'll see in a minute how to fine-tune this behavior.

Turning automatic mode on is a big step since it may prevent you from reading legitimate mails, especially if you don't check the Junk mailbox or you choose to delete your junk mails immediately. Although the number of false positives is extremely low (or, in most cases, null), you may want to add a signature to your mail or a note to your web site, stating that you use anti-spam filtering technologies. You can also ask that your potential correspondents resend emails if they do not receive answers in a certain timeframe.
Fine Tuning and Automating "Junk Mail"
In order to customize the filtering, use the "Mail" menu to open the Mail preferences and click on the "Junk Mail" button. Switching between "training" and "automatic" mode is as simple as selecting the corresponding radio button. As soon as you enter "automatic" you will see that Mail creates a new Junk mailbox with the same paper-bag icon. The following preferences are easily understandable. However, here are a few notes about what they can do:

* Preventing messages that come from senders in your Address Book from being flagged as junk is probably a good choice. However, in some cases, you may not want to leave this feature on. Let's imagine that your aunt has your address but stores it on a virus-infected PC that sends your mail to spammers. In that case, applying filtering rules to the emails she seems to send to you may be a good idea.
* The same applies to the Previous recipients. While this feature can usually be safely turned on, business users or users who deal with dozens of emails per day will probably want to have it off, to ensure maximum protection.
* The fact that a message is addressed using your full name is in no way a warranty that it is legitimate. In fact, in my case, it is almost always a warranty that it isn't. Everyone I know calls me "F.J.", and only spammers who got my name off of a list use my real name.

The "Trust Junk Mail headers set by your Internet Service Provider" feature is great, but only as long as your provider uses standard junk-mail filtering options. Indeed, some ISPs use proprietary solutions that Mail doesn't know. If this is the case, you can create a special rule that scans the "Header" used by your provider to rate junk messages and decide whether it should be marked as junk or not -- a simple task that does not require any programming on your part.

Figure 3. Typical mail headers.

However, when turning this feature on, you will want to take into account how reliable your mail provider's filters are. Indeed, some of them are known for setting up paranoid filters that block all legitimate mails while some others let everything go through. Some of them now allow users to customize filters, a great step forward. In most cases, server-side junk-mail filtering features can be accessed through the provider's webmail interface, so it's worth having a look if you haven't checked in for awhile. You may actually find other nice features there. For example, the .Mac webmail allows you to set up a custom mail icon visible by all Mail.app users.

The "Advanced" button is extremely interesting. Do you remember the old days when Junk Mail was listed in the "Rules" category? Well, this button allows you to see junk-mail settings as a rule. For example, you could also set mail up to run an AppleScript when you receive mail. What about getting the headers of the message so that you can send them to your IT department? Or your email provider?

On a less ambitious scale, you can use this rule to mark junk mails as read automatically -- to avoid seeing the "unread messages" notifications while sorting through your legitimate mail. Play a specific sound as a reminder to have a look through your Junk Mail mailbox from time to time or, let's be crazy, switch the mail color from brown to purple.
What Should I Do with Spam Once It's Flagged?

We've seen that Mail.app will put flagged mail into a special mailbox called "Junk." However, your messages will stay there unless you specifically tell Mail what to do with them.

In order to do so, check the "Special Mailboxes" tab of your various account preferences. It contains a popup menu that allows you to specify what should be done to this mailbox.

Usually, storing junk messages on the server is a bad idea since it will increase the chances that your server will be cluttered and that your mailbox will reach full capacity, effectively bouncing legitimate messages back.

Deleting Junk messages when "Quitting Mail" would be my setting of choice since you probably don't want to keep them on your hard drive for too long. However, you should remember to check this mailbox for false positives before quitting Mail. Otherwise, they go unnoticed and may be deleted without you ever seeing them.

It sounds silly, but I suggest you use this opportunity to make sure Trash (in Mail.app) is set up well and that deleting messages from your various accounts does not simply move them to another folder on the server.

Since the trash setting is applied evenly to all of your accounts, you can set up separate rules to manage them individually if need be. For example, you may want to delete junk mail from your Home account automatically -- since your friends probably won't be too mad if you miss one of their healthy cooking tips. But you should purge your business account every week, so that you have a chance to scan it and avoid missing a potential customer.
Next Time

I'll wrap up this series on Friday with a closer look at techniques for applying rules, address masking, and some general tips to confound spammers. See you then!

The Fight Against Spam, filtro de mail de Apple.

Otros artículos en este blog:

0 comentarios