Duplicate Content and How to Deal with It

December 4th, 2007 | 17 comments

As bloggers we have a constant reminder of the Duplicate Content issue in the form of “splogs,” blogs that scrape our content, usually through our RSS feeds.

Lately my blog has been duplicated by a particularly annoying splog, which has a title that spans about 30 lines. This title is picked up as an incoming link inside my WordPress admin panel so it’s annoying just to see it.

What is Duplicate Content?

Duplicate Content is more or less blocks of content that are consistently the same on more than one URL. There are several ways in which duplicate content can be created:

1. Other sites that scrape your content.
2. Syndication – ie. mass article submission
3. Content Management Systems (CMS) that structure there URL’s in a way that places the same content on more than one URL. ie. WordPress
4. Large blocks of content that appear site wide on any given Web site.

Should I worry about “Splogs” or other sites “Scraping” my content?

In most cases, No. The way in which Google and most other reputable Search Engines rank pages prevents “splogs” and other sites which scrape content from harming the sites of which they steal content from. They do not penalize pages for being scraped. If they did, all you would have to do is start scraping content from your competitors and you’d be golden.

However, if a site that is more popular than yours scrapes your content, their version of your content could very well rank higher than yours. This is where the real problem of duplicate content lies. Fortunately, the number of “popular” sites that scrape content is few and far between so it’s generally not something to worry too much about. When it does happen we have the option of reporting that scraped content to Google.

How to Prevent your own Content from Becoming Duplicated

Although, as I said, it’s generally not something to worry about, it’s always a good idea to prevent pages on your own site/blog, that could be seen as duplicated content, from being crawled by SE’s. If for no other reason, to control your own “link juice” so your “money pages” will rank higher than your duplicate pages.

The WordPress CMS for instance delivers several different version of the same content including:

1. The index (homepage.)
2. Category pages.
3. Archives.
4. Tagged pages (if you use tags.)
5. Navigation pages (ie. blog/page/2/)
6. www and non-www pages if one or the other is linked to and you haven’t set a standard.
7. yourblog.com/index.php and yourblog.com/

…and I could probably come up with more if I really dug into it!

Fortunately, the solution is extremely easy – robots.txt and/or the rel=”nofollow” attribute.

To blog duplicates using robots.txt just insert the following into your robots.txt. If you don’t have a robots.txt, just insert the following into a blank notepad document and save it as robots.txt in your remote folder ie. ez-onlinemoney.com/robots.txt and be sure to change page.php and directory/ to the actual page or directory name. Also, if your blog is in a subdirectory, as this blog is, be sure to insert “blog/” before the page or directory name.

User-Agent: Googlebot
Disallow: blog/page.php
Disallow: blog/directory/

To restrict duplicates using rel=”nofollow” (not as effective as robots.txt because they can still be linked to by other sites, but still somewhat effective) just insert rel=”nofollow” into every link you want restricted.

And if you want to be 100% sure all your dupes are restricted you can use both robots.txt and rel=”nofollow” as I have done with this blog.

offline internet marketing


Related Posts

Leave a Comment with Facebook

17 comments

  1. Marco Richter
    4th December, 2007 at 6:21 pm 

    Thanks for this Josh. Especially with WordPress you should always look after creating duplicate content without your knowledge. Í wrote on my blog about this before.

  2. Chris @ ComicHacks.com
    5th December, 2007 at 4:22 am 

    Josh, great post!

    Dupe content is a subject that I’ve been very curious about and I was especially glad that you and Tim addresses this issue with your “Tim Gorman Interview”.

    I do have a question for you: My first article has been publised at EzineArticles.com. Here it is – http://ezinearticles.com/?X-Force-1—Should-Marvel-Comics-Have-A-Kill-Squad?&id=845628

    Thing is, when I do a search for my article title with quotes in Google – http://www.google.com/search?hl=en&q=%22X-Force+%231+-+Should+Marvel+Comics+Have+A+Kill+Squad%3F%22&btnG=Search – NONE of those links got to my original article at EA!! Some of the links go to EA from the Google search, but none of them go to mine.

    As a matter of fact, there are other articles at EA taking the credit for my title and article!

    Do you have any idea why that is? Just curious if you might know.

    Thanks!

  3. Koozie (1 comments.)
    5th December, 2007 at 8:28 am 

    in wordpress lets say you make a post
    your post is on the homepage and in the category so their are 2 versions of the post

    does that have a penalty?

  4. Jessica Wilkins
    5th December, 2007 at 1:01 pm 

    thanks for this. Duplicate content can be a confusing issue, so it’s helpful to have this advice on how to get around the problem!

  5. Make Money Online - Trent Brownrigg
    5th December, 2007 at 6:33 pm 

    Another good way to deal with duplicate content is to be sure to link to your own site a few times in the content. The a$$holes who scrape content will usually cut off the end of the post because they know that’s where you will probably put your signature and links to your sites.

    Putting links throughout your post won’t stop them from scraping it but at least you will get links back to your site. And if anyone does read the stolen content they might click your link and see that you are actually the one posting good information.

  6. Elliott Cross (7 comments.)
    6th December, 2007 at 12:35 am 

    Great ideas for the robots.txt file. I have been getting hit lately on a couple of my sites with scrappers and have visited their sites…pathetic. You can tell that they scrape their content. Heck they can’t even get the names right on who wrote the article.

    One scraper has actually done me some good. He just copied my article text, not the photos or anything else, and put a link back to me. He’s sent me about 20 visitors so far and a few new RSS subscribers.

  7. YC
    6th December, 2007 at 1:32 pm 

    Thanks for the robots.txt tip, Josh – have always wondered about this but never quite found anyone writing about it. Would be helpful in dealing with all that content being scraped on my blog.

  8. Josh Spaulding
    6th December, 2007 at 1:54 pm 

    Sorry for the short, or lack of replies. I still have limited Internet access and alot of things to catch up on.

    @ Chris, The #1 result is showing your article now. In many cases the reason other articles or pages from EA show up before the actual article is because those pages show related and/or recent articles in that genre. If those pages are indexed before the page of which your article appears, they will be returned in the exact phrase match. I hope that’s what you mean?

    @ Koozie, There will always be a little duplicate content with WordPress, but as long as you reduce it to a minimum, as I have on this blog, there will rarely be any issue. There is no “penalty” either way.

    @ Everyone else, Thanks for the compliments etc. :) greatly appreciated.

  9. Chris @ ComicHacks.com
    6th December, 2007 at 7:28 pm 

    Thanks for the explanation, Josh.

    Though I’m still not seeing it on the first Google results page, I understand what you’re saying.

    Hopefully, it’ll eventually show up as the first result!

  10. Josh Spaulding
    7th December, 2007 at 10:56 am 

    No problem Chris. SERPS vary depending on the IP/location of the user, so me being in Europe may be the reason why I’m getting different results.

    I wouldn’t worry too much about it though. Just keep writing and submitting and you’ll see results. Getting stuck on small details will only slow you down :)

  11. Dewald - Fighting Duplicate Content on WordPress (2 comments.)
    10th December, 2007 at 4:08 am 

    I’ve implemented a different approach to getting rid of duplicate content; using a WordPress plugin. It’s rather simple in principle, even though it takes some one-time work.

    The approach is to make each page view of the content on the blog unique. That way it doesn’t matter if the same content appears in more than one location on the blog and it doesn’t matter if someone steals a copy of the content, because no two page views of the content will be the same. In other words, no two page views will be duplicates of one another.

    My solution is located at wpspinner.com

  12. Josh Spaulding
    10th December, 2007 at 1:19 pm 

    Dewald,

    That’s a very creative video you have on the sales page ;) However, I really don’t see this as a solution, no offense. I suppose those who aren’t concerned with what their readers think of their content then it could be somewhat of a solution. But if your plugin is “spinning” your own content, I can’t see it as something that would be positive in the long run.

    If I visit a blog that spins it’s own content and I’m subscribed to that blog, I won’t be for long.

    It also gives spammers more unique content. Now they have content that is just as unique as the version you’re using.

    I’m not a big fan of any kind of spinner/spooler. It’s all about user experience and if you’re providing quality content, you’re blog is going to rank much higher than those splogs so you really don’t have much to worry about.

  13. Dewald - Fighting Duplicate Content on WordPress (2 comments.)
    10th December, 2007 at 2:31 pm 

    Josh,

    I would agree with you if the WP Spinner plugin performed auto-allocated or auto-generated spinning, but it does not. I steered clear of auto-allocation and auto synonym nofollownt because that tends to lead to gibberish, or to weird sounding sentences.

    You can set up your content that the meaning of every alternate content version is exactly the same and grammatically correct, just the words to convey the message are different.

    I’m doing that on the sales page itself. The alternate content rotation on the sales page is slower (not every single page view) because I’m running WP Cache, so you’ll see content rotation only once every few minutes. Not all spinning has to be as drastic as the Test Drive page.

  14. Stephen Cronin (34 comments.)
    14th December, 2007 at 8:10 am 

    Hi Josh,

    Great article. I use the All In One SEO plugin which manages this in WordPress for you (as well as adding meta tags, etc), but there are lots of approaches. Personally I don’t think it’s worth spinning your own content though (no offense Dewald).

    Relating to the content scrapers, here’s a blatant plug: I recently released a new WordPress plugin called FeedEntryHeader which adds a customizable copyright message, including the URL of the original article, at the TOP of your feed entries. The idea is people will see this BEFORE they read the article, then hopefully visit your site to read the original.

    There are plugins that let you do this at the bottom of entries, but as Trent says in a comment above, scrapers often cut this off. Many also strip all links from the content, so links in the middle won’t help in many cases. In the default message that FeedEntryHeader adds, the URL is used as the anchor text, so even if the link is stripped, your URL is still there.

    Anyway, sorry about the blatant plug. I hope you’re having a great time there in Germany!

  15. Josh Spaulding
    15th December, 2007 at 11:21 am 

    @ Dewald – I guess it’s just over my head and not so interested to me, no offense!

    @ Stephen – I came across your plugin, but I guess I didn’t realize what it did. Very interesting!! I think I’ll give it a shot, thanks!

  16. G-Man (5 comments.)
    24th December, 2007 at 12:12 am 

    I blogged about how the duplicate content filters work if you’re curious just click on my link above.

    Great article as usual Josh.

    G-Man

  17. Josh Spaulding
    26th December, 2007 at 7:51 pm 

    Thanks G-man, I’ll take a look.

Leave a reply

© 2011 Dot Com Solutions, Inc. All Rights Reserved. Syndication is not authorized without consent.


Disclosure Statement | Privacy & Disclaimer