Humans and/or robot scripts are using open-access web sites (i.e., RW), such as wikis, forums, and blog comments in order to publish their site URLs and increase their google pagerank.
See the "link spamming" section of http://en.wikipedia.org/wiki/Spamdexing
About Blog Spamming, see http://en.wikipedia.org/wiki/Blog_spam
About Google PageRank: http://www.google.com/technology/
Study of what's going on (at ruby wiki, tcl wiki, etc.): Note that this is still very very rudimentary. We should collect more data on the type of spam, other wikis, occurrence, patterns, etc. That will help us know how to react I think.
Note 2: It seems to me that Wiki Spamming is very very different from mail spamming. Let's not try to apply the same kind of solutions to both.
The type of solutions to apply to this probably are twofold: Either technical (e.g., change the engine code), or community based: ask the wiki community to clean-up faster than the spammers, or even ask google to do something about this (why not?). Details follow. See also quite a very complete listing of possible solutions at http://www.usemod.com/cgi-bin/mb.pl?WikiSpam
(ak: Third is of course the combination of technical and community solutions)
There are many potential types of answers. Maybe a mix of solutions is a good start? I think the very first thing to get is a good wiki backup/revision system (which we have here), then maybe a good community reading Recent Changes every day, and then a blend of technical helpers and fight back techniques... ;-). It's up to us to sort the ones we want to apply to the Tcler's wiki. Please edit this page and add your comments.
(copied from http://wiki.chongqed.org//Manni )
Ok, so I did all this research just to be able to say that (just kidding :-), but here is what I think concerning wiki spamming: CM 30 Sep 04.
Based on this, I propose the following for the Tclers wiki:
involving only the Tcler that is trying his/her new ideas... :-).
In conclusion. First, we completely master the wiki engine and can incorporate whatever new function is thought necessary in order to reduce the spam. Then, even better, the wiki users are all fans of programming and scripting! So they can for example provide script outside the wiki domain but that can perform analyses, trigger alerts, etc. This is a very valuable asset and it would be a pity not to at least try some technical solutions :-).
Comments.
When dealing with proposed solutions pros and cons, you can directly edit my text and add points and references. Please try to separate facts and exhaustive listing from opinions, that will help us keep an easy to read and useful page (I hope).
30-Sep-04 DaveG: Since the Wiki pages are generated each time, have the Wiki server return the NOINDEX meta tag for any pages that have been created or modified in the past, oh, 7 days. This will allow ample time for cleaning up the Wiki, and eventually the stable and vetted content will be indexed by the Googles of the net.
SS30Sep2004: It's not strange that some wiki spammers are using bots in order to spam, or they will start soon, so why don't implement a security code, (that's an image hard to analyze for an OCR with a number the user need to digit in a text area) in order to be sure that the editor is human. I did it at http://wiki.hping.org (if you want to check how it works, try to edit some page there), and in the last two days I had no spamming problems (before of this it was an every-day problem for me), don't know this result is due to my spammers being robots, or my spammers being human that want not to deal with a security code, or just unable to read English (for the security code instructions).
If this will not work I'll start with blacklisting of IP addresses. IP will be blacklisted every time a spammer operates (but of course not only the single IP address, but all the network or something like this). If there is a good user that want to edit but can't because of the blacklist, there will be an automated procedure to follow in order to be able to unblock a given IP address even if it's blacklisted, but the procedure will take 30 seconds or so: this way wiki spam costs more to spammers:
It's not perfect.. and experimental, but wiki.hping.org reached a spam level that I need to found a solution, because I'm the only to deal against spam there. The Tclers wiki at least have a big community that regularly fix the wiki.
jcw - All good points, ideas, and suggestions IMO. But we should not allow this to take the direction spam discussions sometimes go: people spending ages debating ideas (good ones, I'm sure), while no real action is taken. There is a balance between just cleaning up the cruft and creating mechanisms which do it for us (or avoid it). My vote would be to choose between one of these soon:
A refinement dkf mentioned on the chat is to enable rollback only for people who are registered (need not be mandatory).
There have been some objections to mandatory registration (and hence insisting on cookies). Are there other options which 1) we could agree on, 2) someone is willing to implement, and 3) don't need much further tweaking once adopted?
CM I'm sure an effective rollback mechanism, plus the fact that revisions pages are ignored by the googlebot (otherwise, it's not useful, see above) would be a very good mechanism to have. I don't think people would have problem to moderate its use (e.g., by the registration/email/cookies) as it does not prevent everybody to write on the wiki. If really we need to enforce something like login for posting, then why not add a test on the fact that the newly published text has got outside URLs in it? This first option (mandatory login) does restrict the wiki's usefulness IMHO..
LES: I really like what I already have in Yahoogroups: a login system and automatic moderation of every member's first posting. The group I keep there uses this system has been free from spam for more than a year, maybe two.
That system gave me an interesting experience. We were spammed for a couple of months, then never again. I mean, I only had to approve legitimate messages since then because there have been no ill attempts. Meanwhile, I see spam grow in other lists I subscribe to. That makes me think that spammers actually keep track of what groups are moderated or not. Put any protection mechanism in place here and they will soon look for prey elsewhere.
I imagined a "trust network" system. A few wikit contributors would be considered "trusted" from the very beginning. New posters would be considered "untrusted" by default. "Untrusted" posts would be signaled somehow in Recent changes . Say, an alert signal next to the entry. Or even retained for moderation, if you want. Any "trusted" contributor would be able to visit an administration page and change someone's status from "untrusted" to "trusted". That would require a login system and cookies, though.
jcw - Could we somehow use a graphic and then respond based on the coordinates of the click? (reminds me of those goofy dialog boxes where the dismiss button moves away as you try to click on it) It can't replace the Save button unfortunately, as that would not send back the edited text, but perhaps an extra page after the save could be used? I have no idea what the graphic should be, just wanted to pass on the basic idea...
LES - Interesting idea indeed. I know for sure that PHP can do that. But I have no idea of how it works inside and could be implemented in wikit. But PHP is open-source, you know. :-) One can look at PHP's source and see how they implement graphic coordinates. Maybe it's not even PHP-dependent. Visit this page [L1 ] and look for "coordinates". Seems very simple.
Lars H: The problem with graphic response thingies is that these make it impossible to edit Wiki pages from within a text editor (as I do right now). For some purposes, browser editing facilities suck.
jcw - Not sure: you edit, you save, as you do now, then comes up a page with a graphic? (Lars H: Comes up where? The HTTP POST action is carried out by the text editor! It certainly gets some HTML back, but it has no ability to display any graphics.) Or we could add whitelists for the regulars.
Just to take this a bit further: edits remain as is, but a changed page is flagged as unverified. A page comes up with a way for people to click on a spot which marks the page as being ok (refinement: a different spot each time). Unverified changes are revoked after a certain time (could be minutes/hours). Can be combined with other ideas. Note: this is still merely an idea: we can shoot it down, ignore it, improve it - time will tell.
DKF: This is too elaborate and likely to annoy regulars. Just do the simple thing by allowing verified users (for which you - in theory at least - have an email address for) easy access to the change tracking an reverting mechanism. If healing the wiki becomes less work than spamming it, the community will be able to hold its own against the spammers for a good while.
DRH: I suggest a whitelist of approved links. Any hyperlink not on the whitelist does not get <a> tags generated but instead appears as ordinary text. Registered users and/or moderators can visit a special page that shows all URLs in the wiki that are not on the whitelist. New URLs can be added to the whitelist with a single click from the moderator, or the wiki updates containing spam URLs can be removed with a single click.
A/AK: The most useful feature for spam fighting would be undo all changes that were made from the given IP address. I've just noticed (and cleared) 5 spammed pages, and the spammer's IP address was the same for them all.
jcw - Ah, good point. The same feature is in the inventor's wiki, at c2.http://c2.com/cgi/wiki/ - hm, yes, that might be doable from the most recent page changes wikit already saves for CVS history.
SS - just a note on coordinates: It's up to the browser to send coordinates of <input name="foo" type="image" ...> as regular POST or GET variables foo_x and foo_y, so from a C CGI, to a Tcl ncgi script, to Tclhttpd, all will be able to use this stuff.
Joe(at)chongqed.org - Going by the number of individual spammers, most probably are human, but the worst of the problem is the automated spammers. Just a few automated spammers can do far more damage than all the others combined. I don't think any wiki spammers are totally automated, more likely they kind of supervise the bots. Often even with spammers that are automated (edit lots of pages in short period) you can see that they try several different linking methods since not all wikis use the same syntax. Once they get it worked out they seem to let the bot do its work.
I don't think its that spammers won't care that you have a robots.txt or meta nofollow, the problem is they will probably not notice. Human spammers likely don't look and robot spammers would have to be programmed to look. But more important, do we know for sure nofollow doesn't still increase the link's PageRank. It says don't follow the link, but it doesn't say that link doesn't exist for Ranking purposes. For that reason I have never suggested this solution before. If your page is still very visible in Google (ie. you have a good PageRank) spammers aren't going to notice you have a nofollow or may be like me and think that may not block the help to their PageRank. For them it doesn't hurt to spam anyway on the chance it does help. I suspect if this did work it would be suggested on one of Google's page as a way to lessen the effect of spam. Another similar idea I have seen thrown out is to use add a noindex tag to all pages, that would hurt the wiki since no one would be able to search it or find it in an engine anymore.
Another technical solution is to limit the number of pages a user can edit in a certain time period. A normal user shouldn't need to edit more than 1 or 2 pages in less than a minute, or edit a single page more than some number in 1 mintue. I have seen this sometimes called edit throttling. I don't know if any wiki has implemented it yet.
Spammers are starting to login already. Its not common yet, but I have seen at least 3 or 4 different spammers do it (2 of those last week). Unless you require a password (which could drastically hurt the wiki community) its not going to provide much protection in the near future since just creating a login is no problem.
I don't think a URL whitelist is anti-wiki. It may be the one of the best methods to save the wiki format. A similar method was used pretty effectivly (though waiting for URLs to be approved is a pain) on POPFile's wiki until a spammer accidently ran into a UseMod bug. See http://wiki.chongqed.org//SpamBlockLoop for a description of the problem. Back to proof that even automated spammers are watching (not counting that guy), after the URL blocking, POPFile was hit by a few spammers. They attempted to get around the block by entering their URLs in different ways and usually gave up within 3-10 trys.
I don't think DaveG's idea of returning a noindex on pages that have been edited within 7 days is a good idea. When Google sees a noindex on a page that is already indexed it removes the page. Thats Google's suggested method of removing a page from the index. It would prevent the spam from being indexed, but could leave major portions of your wiki out of Google. An active wiki will always have some pages that are edited rather frequently. Even if its less frequent that 7 days the timing of Googlbot's visits could still leave pages out of the index.
Thanks for linking to us and giving such a good description of our methods and all the other good ideas.
MC GoogleBot identifies itself in the User-Agent header; instead of adding a NOINDEX meta tag to page edits within the past 7-days, if the user-agent is GoogleBot, send back the last known good revision (if the last edit is within 7-days). No harm really, since we don't expect Google to be indexing the wiki in realtime anyway, yet it still gives a reasonable window for people to clean up after vandalism.
04oct04 jcw - It's encouraging to see how many people are trying to come up with good solutions. Some proposals (such as rejecting edits from sites in CN) are likely to only be moderately effective, and only for a short time. Some tighten edit access, which is at odds with wiki zen. Some introduce an approval mechanism, and require moderators. Some focus on quick revert, making undo's nearly effortless. Ward Cunningham, wiki's inventor, recently said that he has no good answer yet. Let's keep this going, I'm certain that the right approach will float to the top...
4thOct04 NEM - The quick undo options seem like the best solution (although, you'd have to make sure you could undo the undos, just in case). I think the wiki has benefitted immensely from ease of editing by anyone, so it would be bad to start requiring logins or such. One idea that occurred to me: the spamming that I have seen on this wiki has involved a very large number of links added to pages. Perhaps a simple limit on number of external links that can be added in one edit?
04-Oct-2004 DKF: I always saw undo being done by submitting the old version as a new version, and not by revision history pruning. Pruning just makes for abuse potential.
20041005 CMcC: There are more IP addresses for spammers to use than there are websites which employ spammers. Rather than whitelisting links, a regexp filter which blacklisted links would prevent anyone from anywhere creating a link to a known spammer's patron's website. Any edit returning a page containing a blacklisted link would fail.
This removes any economic benefit to wiki spamming. The regexps could even be shared (a special /spammers page?)
The task of creating a link blacklist would be as simple as pointing at a successful spam and blacklisting all hosts in all URLs added by that edit. An additional wrinkle would be to blacklist hosts appearing in any edit attempt which failed through blacklisting (although this could be open to abuse.)
The response is a communal one similar to page reversion, but instead of simply reverting a page, one reverts with prejudice, and the wiki develops an immune response to the toxin. The process of reducing specific hosts to regexps would be an administrative task with low frequency.
07-Oct-2004 CM: I have been studying a bit more wiki spamming either here and elsewhere and found this: Some examples of spammed pages, times, IP addresses, domain names, keywords, spammers "names" and habbits. Hope it'll help.
Would it be possible to have the IP address in the "revision history" listing? In order to get more data about potential spammers.
Also, I haven't notice any nofollow tags in the /tclrevs pages. If it is not mentionned in the global robots.txt file, we should modify the scripts so as to insert tags each time a history page will be generated, otherwise google will follow those links and find the spammers' sites. Another nice way would be to redirect googlebot only when it tries to follow the "Revisions" link to the actual page, so that only the current content would ever be seen by Google. Let's not try to redirect spammers bots, it'll be a lost battle, but let's at least prevent spamming from actually working, even if they don't care. This, plus reporting to chongqed.org, will increase the chances that when somebody type their names they'll go to a "This Guy is a Spammer" page instead of to their site!
Otherwise, thanks for all the comments and interest. Especially to Joe for his long comment with some new ideas and some facts to let us know that URL blocking with a white list might be working. Maybe we'll work out some good tools to fight spam... I like for instance the A/AK idea of having a way to suppress all things done by a specific IP address in one click! That would really help the cleaning dramatically!
Thanks to all for working on this issue. And let the spammers know that: their links are removed quickly, spamming doesn't work here, and we will fight back! I do not intent to report previous spamming but new ones, yes, all of them. If they know that it's dangerous for their business, they will probably look somewhere else! :-)
Christophe! Thanks for your latest reports. Altough your intention was to make us aware of ec51, I quickly noticed a couple of spam links to subdomains of freewebpages.org. I reported all of these to the administrators of freewebpages.org. We already have some spammers in our database that used their services and all of these give just a 404 error now. So there is hope and I guess it makes sense to report as much spam as possible. Of course, sometimes you may just be wasting your time and surely I won't write reports to chinanet or other such spammy-as-hell providers. But sometimes, spammers seem to be stupid enough to choose a good provider. -- Manni
20041007 ECS - An easy way to find all or the most recent pages changed from an IP address would be helpful. We could then examine changed pages and revert them as needed.
As people report problems here on the Wiki , be certain not to report the spammer URLs in a format that the wiki will turn into a link. That hopefully will also reduce the benefit of filling up the wiki.
11oct04 jcw - I just found out that my .htaccess blacklisting worked for wiki access, but not for raw edits via the cgi-bin/ URL (doh!). Fixed now, so several vandals should have less success from now on. Also, if the wiki is slow (as it was until a few moments ago): this is usually caused by spiders. In this case, someone was running through all edit URLs (which is not in the cache and causes a CGI script launch). (Insert comment about universes and idiots here some day...)
22oct04 DPE - Fast Fourier Transform page spammed (for all of 13 minutes before I fixed it). 9 URLs for 'www paite dot net' and 6 for 'www wjmgy dot net'. The sites and keywords are both listed on http://chongqed.org
DKF (same day) - I wonder if it would be possible to check on edit submission whether a URL listed is on the chongqed list? I suppose it could be cached locally (TTL a few hours?) if the cost of doing the remote lookup every time is too high. (That spammer also hit the Starkit - How To's page. He came from 221.219.61.102 which is in China, of course.)
RHS Thats a really slick idea, DKF. If its not convenient to get a list of links off conqued, it might be work asking them to implement something to make it easier (rss, soap, etc).
Of course! It's always worth asking us. Just tell us what you need. But let's keep it simple. What we already have in store is this: http://blacklist.chongqed.org -- Manni
27 Oct 2004 CM God, this is really good! Firstly, the idea of CMcC (20041005) was (IMHO) excellent: do not target IP addresses but links themselves instead, as they do not vary so often and are really the goal of the spammer. In fact, when these are links of their customer's web sites, they will be really pissed off to see they are black listed. I believe that this could little by little put an end to wiki spamming as we know it today. Secondly, I think that designing a central site for black list and a flexible API/protocol does seem like a good idea and I support the motion of collaborating with chongqed people on this. I'm not sure we will have much to share as maybe some spammers know some wikis, other target blogs, etc. and they might not necessarily be the same (?). People using moveable-type already have a mechanism like this, with a list of Perl-type regexps (see [L5 ]). It's interested to study and take the URLs into account, however it seems they are quite different from the ones found on wikis.. maybe I'm wrong. On the other hand, I strongly believe that the same spammers, spamming for the same sites, always come back to the same wikis, and so the benefits of silently ignoring their edits is rapidly becoming huge, even when a site is maintaing its own black-list of URLs..
I started experimenting a little with my local wikit and as simple as adding three lines would prevent the last pages of spam that were mentioned here (including the page CMcC which I studied intensively :-). Here is the patch:
*** modify.tcl~ Thu Jul 10 11:54:20 2003 --- modify.tcl Wed Oct 27 12:07:56 2004 *************** *** 108,113 **** --- 108,116 ---- # avoid creating a log entry and committing if nothing changed set text [string trimright $text] if {!$changed && $text == $page} return + set black {shop263|haishun|7766888|asp169|fm360|genset-sh|sec66|xhhj|cndevi|sinostrategy|paite|wjmgy} + if {[regexp "http://.*.cn" $text] == 1} return + if {[regexp "http://www.($black)." $text] == 1} return # make sure it parses before deleting old references set newRefs [StreamToRefs [TextToStream $text] InfoProc]
Jean-Claude, could we try this?
Maybe I was a bit extremist with the first regular expression.. but I did a google search and there was no URL of this type as of today... :-)
01nov11 jcw - The above may be wider than you intended, you probably want "http://.*.cn " and "http://www.($black ).". Right now, spamming seems to have gone down due to another simple measure I introduced, so I'm tempted to leave it as is while that lasts. But I agree that with the above and an external blacklist we could tackle the next level of escalation when needed.
27/11/04 - new spam attack - from China. 7 pages like 'help' vandalised. Didn't get a copy of them before they were repaired. -- CMcC
Lars H: You can still get the info (e.g. for reporting the spammer on chongqed.org) from the page revisions.
30/11/04 - more Chinese spam, on wiki gripes page.
19jan2005 jcw - Would it be an idea to add "rel=nofollow" to all wiki-generated links? See [L6 ].
DKF: But surely we want the Wiki to be strengthened by normal external links? (Definitely don't add it to normal intra-Wiki links, of course.)
I think getting google et. al. to blacklist all domains on blacklist.chonqued.org would be the most effective strategy I've heard. -MrElvey _Not to mention blacklisting (i.e. not indexing) surbl.org's list of IPs! Anyone work at google? Update this if you make a connection or send 'em feedback (e.g. how can we improve these results?)
RS 2005-01-27: Another attack, 9 pages, IP: 83.217.6.205
jcw - I wish there were a pattern. I've added a link on several pages to a prominent image describing the fact that this site uses rel=nofollow and that entries on this wiki no longer affect page rank. See The Tcler's Wiki for an example of how it looks and how to insert it elsewhere. Keep in mind that a [...] image link at the top of a wiki page becomes the page title, which is why I inserted a horizontal bar as well.
RLH I have seen on blogs recently that you have to add two numbers together before you can actually post. If that was rolled in it would at least cause a human agent to do the posting. Just a thought...
JSI - jcw, would you like to add the image to the edit screen? This way human spammers will see it - regardless which page they edit and in my opinion the Wiki would suffer from adding the image to any page.
Yes. I just put this in for now as stopgap measure, tweaking the wikit code for edit screen sounds like a good idea. Will look into it in a spare moment. -jcw
JSI - I'd suggest the "footer-solution": Just add an ID to the H2-heading of the edit screen and add the image to the heading as background via CSS. This way the change to the wikit-codebase would be minimal.
Thanks for the idea. It ended up being even simpler: I did a local-mode edit of Wiki CGI settings and added the image. Voilà - trivial, once the proper approach floats to the top! -jcw
DPE The following page has been spammed and needs restoring to a previous version https://wiki.tcl-lang.org/9530
Lars H, 28 May 2005: We seem to suffer from a new wave of Wiki spamming. Characteristics so far:
Well... if it's just two pages and a single vandal - perhaps just leave it in after a few cleanup attempts - bits are cheap. -jcw