Joomla! Discussion Forums



It is currently Sun Nov 22, 2009 2:41 am (All times are UTC )

 


Forum rules

Forum Rules
Absolute Beginner's Guide to Joomla! <-- please read before posting, this means YOU.
Forum Post Assistant - If you are serious about wanting help, you will use this tool to help you post.



Post new topic Reply to topic  [ 20 posts ] 
Author Message
Posted: Tue Aug 23, 2005 7:17 pm 
Hi

I'm managing a Mambo site, that uses ExtCalendar.

Problem with this calendar is that most search engines are indexing many, many  years when indexing the site. This month, search engines have used almost 2GB bandwidth!

I can see in the logs, as I write this, that it's indexing the year 2022!!!!!

And this is not only Google.

This is some of the urls in the log:
 
Code:
/index.php?option=com_extcalendar&Itemid=27&extmode=addevent&date=2022-06-21
   Http Code: 200    Date: Aug 23 20:49:36    Http Version: HTTP/1.1    Size in Bytes: 60608
   Referer: http://www.cochleaklubben.no/index.php?option=com_extcalendar&Itemid=27&date=2022-06-01
   Agent: Findexa Crawler (http://www.findexa.no/gulesider/article26548.ece)
   |
   |
   |
      
/index.php?option=com_extcalendar&Itemid=27&extmode=addevent&date=2022-06-22
   Http Code: 403    Date: Aug 23 20:50:34    Http Version: HTTP/1.1    Size in Bytes: -
   Referer: http://www.cochleaklubben.no/index.php?option=com_extcalendar&Itemid=27&date=2022-06-01
   Agent: Findexa Crawler (http://www.findexa.no/gulesider/article26548.ece)
   |
   |
   |
      
/index.php?option=com_extcalendar&Itemid=27&extmode=addevent&date=2022-06-23
   Http Code: 403    Date: Aug 23 20:51:34    Http Version: HTTP/1.1    Size in Bytes: -
   Referer: http://www.cochleaklubben.no/index.php?option=com_extcalendar&Itemid=27&date=2022-06-01
   Agent: Findexa Crawler (http://www.findexa.no/gulesider/article26548.ece)


This is the robots.txt file I have now:

Code:
User-agent: *
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /editor/
Disallow: /help/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mambots/
Disallow: /media/
Disallow: /modules/
Disallow: /templates/
Disallow: /installation/
Disallow: /cgi-bin/
Disallow: /mambo/
Disallow: /orghtml/
Disallow: /phpmysqlautobackup/
disallow: /*?
Disallow: /index.php?option=com_extcalendar


As you can see on the two last entries, I have tried some new entries that does not look like it's making any changes in indexing.

Is there anybody that have some sulution to this, or is the ONLY way to stop this, to stop using any type of calendar in a Mambo site?


Top
   
 
Posted: Wed Aug 24, 2005 1:56 am 
User avatar
Joomla! Ace
Joomla! Ace
Offline

Joined: Fri Aug 19, 2005 2:26 am
Posts: 1789
Location: Lancaster, Lancashire, United Kingdom
I can only discuss a theoretical solution:
the links produced by the component (or is it a module - I don't know) could be altered through the code to include the rel=nofollow attribute that has recently been accepted by the large search engines. This would involve changes to the code, or perhaps could be implemented through a content mambot. Other than the theoretical I can offer no real assistance - I am not familiar with the component/module in question.  I think this is something that future Components / Modules could offer as configuration options in their respective admin sections.

Dean Marshall.

_________________
Dean Marshall - Mambo and Joomla Consultant
Dean Marshall Consultancy Limited - http://www.deanmarshall.co.uk/


Top
   
 
Posted: Wed Aug 24, 2005 2:11 am 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
continuing the previous post... it may be the easiest to do in your template... just add this to your head section:
Code:
<?php if ($option == 'com_extcalendar') { ?>
<meta name="robots" content="noindex,follow" />
<?php } ?>


Of course the search engines will still query this page at least once but they should not index it.

_________________
http://de.siteof.de/


Top
   
 
Posted: Wed Aug 24, 2005 2:35 am 
User avatar
Joomla! Ace
Joomla! Ace
Offline

Joined: Fri Aug 19, 2005 2:26 am
Posts: 1789
Location: Lancaster, Lancashire, United Kingdom
Don't you hate it when you miss the easier / more elegant solution!
Well spotted 'de'.
I would consider making that a nofollow in the meta though - unless there is a good reason not to.

Dean.

_________________
Dean Marshall - Mambo and Joomla Consultant
Dean Marshall Consultancy Limited - http://www.deanmarshall.co.uk/


Top
   
 
Posted: Wed Aug 24, 2005 2:45 am 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
Yeah, sometimes I am not seeing the simpler solution either.
I thought to use follow to still catch all the links on that page in case one of it is one to index, but then using nofollow probably reduces the load of the search engine. So I did not think of a very good reason (the indexable links are probably on other indexable pages anyway).

_________________
http://de.siteof.de/


Top
   
 
Posted: Wed Aug 24, 2005 5:40 pm 
Thanks for your answers and suggestion.

I did enter code in template index.php file, changed to "nofollow".

But when I look at "Source" in my web browser, I see this about robots:

Code:
<meta name="robots" content="index, follow" />


I'm no expert at this, but it looks for me that the code I placed in template/index.php does not work.
I have looked around in some of the files in the Mambo folder, but I can not see where this code is, that says robots can follow index.

Anyone have any idea where this code is, since it's not in the template/index.php file ?

URL to side: http://cochleaklubben.no/


Top
   
 
Posted: Wed Aug 24, 2005 6:01 pm 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
Ok, you are right... it contains both... actually I was afraid it is the case...
Now there seem no function to remove/overwrite a meta tag. It would still be possible... You could try the following (which should work with 4.5.2.3 at least... it would be simpler to hack the core... but this way you don't need to):
Code:
<?php
if ($option == 'com_extcalendar') {
  $metaArray =& $mainframe->_head['meta'];
  foreach(array_keys($metaArray) as $key) {
    if ($metaArray[$key][0] == 'robots') {
      $metaArray[$key][1] = 'noindex,nofollow';
      break;
    }
  }
} ?>


It assumes that that meta value was added... which is the case (thats why you had it twice).
Hope it works (did not test it at all I may admit).

_________________
http://de.siteof.de/


Top
   
 
Posted: Wed Aug 24, 2005 6:05 pm 
de wrote:
Ok, you are right... it contains both... actually I was afraid it is the case...
Now there seem no function to remove/overwrite a meta tag. It would still be possible... You could try the following (which should work with 4.5.2.3 at least... it would be simpler to hack the core... but this way you don't need to):
Code:
<?php
if ($option == 'com_extcalendar') {
  $metaArray =& $mainframe->_head['meta'];
  foreach(array_keys($metaArray) as $key) {
    if ($metaArray[$key][0] == 'robots') {
      $metaArray[$key][1] = 'noindex,nofollow';
      break;
    }
  }
} ?>


It assumes that that meta value was added... which is the case (thats why you had it twice).
Hope it works (did not test it at all I may admit).


Thank's for fast answer  :D

Sorry to ask,, but do you mean to but this code in header of template/index.php  ???


Top
   
 
Posted: Wed Aug 24, 2005 6:07 pm 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
macern wrote:
Sorry to ask,, but do you mean to but this code in header of template/index.php  ???

Yes, actually it just need to be placed now before the function mosShowHead is called... just replace the previous code with the new one and it should be fine.

_________________
http://de.siteof.de/


Top
   
 
Posted: Wed Aug 24, 2005 6:35 pm 
Thanks again.

Sorry, but it does not seems to work.
I still see looking at the source of page in my broswer.
Turned of cache just in case and reloaded several times.

I think this problem, is one of the biggest with Mambo. That search engines digs through so many pages.
I have also seen this problem on other sites using Mambo, and other versions of a calendar type, not just the type used on this site.

So far I had to block the IP to the search engine that was indexing the calendar from about 1980 to 2030 something  :o
And this is not only Google, but several other search engines also.


Top
   
 
Posted: Wed Aug 24, 2005 6:46 pm 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
Ok, sorry... this time I looked a bit deeper... it is kind of stupid... but the function mosShowHead adds the header itself which kind of defeats the purpose of the array...

So you cannot do it without hacking... remove the previous code again... no template code anymore...

Then edit /includes/frontend.php
Change line 151:
From:
Code:
   $mainframe->addMetaTag( 'robots', 'index, follow' );

To:
Code:
  if ($option == 'com_extcalendar') {
   $mainframe->addMetaTag( 'robots', 'noindex, nofollow' );
  } else {
   $mainframe->addMetaTag( 'robots', 'index, follow' );
  }


This should now work... else you are allowed to tell whatever words to me you want ;-)

Btw. the problem is now that you have to do it with every Mambo update.

_________________
http://de.siteof.de/


Top
   
 
Posted: Wed Aug 24, 2005 7:04 pm 
Tulling  :P
(Norwegian)

I'm not going to pretend I'm an expert at this. But I imagine that this would show up, when I look at the source of the page in my web browser?

It still say the same as before?
And I even tried in a browser I know does not have anything cached from this site.

Any other suggestions?

A simple "hack" like your previous suggestion is not a problem.
I make a record of every change I make on a site, so I can easy find it, and remember it for other sites using Mambo that needs anything special.

Thanks for you time de  ;)


Top
   
 
Posted: Wed Aug 24, 2005 7:10 pm 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
Not sure what Tulling means and I am not quite sure I want to know :P

But I just refreshed your page with the calendar and it says noindex,nofollow :-)

_________________
http://de.siteof.de/


Top
   
 
Posted: Wed Aug 24, 2005 7:32 pm 
Sorry  :'(

I was looking at the source on frontpage only!!

If I go to calendar page, as you say, I see the same.

I imagined it would show up on the frontpage  ???

Thank you again ed  ;D

I'll keep a look at the logs for the next days, and report back to confirm if this is stoping this problem or not, OK!

Thanks for you help  ;)


Top
   
 
Posted: Wed Aug 24, 2005 7:45 pm 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
Well, I guess you want your site beeing indexed thats why you have to let it index the pages not beeing the calendar-component itself.
In your logs you will still see search engines accessing the page because they don't know in advance that they should not index the page (in oposite to the directives in the robots.txt). I assume that they will reduce the access on the not-to-be-indexed page but will still check again after a while to see whether they should still not index that page. Also the nofollow will result in the search engines not following the links it will find and thereof should crawl less pages.

So in short: you will still see search engines access those pages but hopefully reduced. in addition you should not find any of those pages indexed.

Using SEF you could probably use the robots.txt to disallow for example an /events directory completly.

Also possible would be to change all links of the calendar from:
Code:
<a href="http://xyz">Link</a>

to:
Code:
<a href="http://xyz" rel="nofollow">Link</a>

But this is probably some work.

I hope you will still be happy with the results.

_________________
http://de.siteof.de/


Top
   
 
Posted: Fri Sep 02, 2005 6:12 pm 
Sorry de, the crawler is back indexing the calendar

Code:
/index.php?option=com_extcalendar&extmode=day&date=2021-09-27
   Http Code: 200    Date: Sep 02 20:04:24    Http Version: HTTP/1.1    Size in Bytes: 50149
   Referer: http://www.cochleaklubben.no/index.php?option=com_content&task=view&id=5&Itemid=0&date=202
   Agent: Findexa Crawler (http://www.findexa.no/gulesider/article26548.ece)
   |
   |
   |
      
/index.php?option=com_extcalendar&extmode=day&date=2021-09-28
   Http Code: 200    Date: Sep 02 20:05:24    Http Version: HTTP/1.1    Size in Bytes: 50150
   Referer: http://www.cochleaklubben.no/index.php?option=com_content&task=view&id=5&Itemid=0&date=202
   Agent: Findexa Crawler (http://www.findexa.no/gulesider/article26548.ece)
   |
   |
   |
      
/index.php?option=com_extcalendar&extmode=day&date=2021-09-29
   Http Code: 200    Date: Sep 02 20:06:24    Http Version: HTTP/1.1    Size in Bytes: 50153
   Referer: http://www.cochleaklubben.no/index.php?option=com_content&task=view&id=5&Itemid=0&date=202
   Agent: Findexa Crawler (http://www.findexa.no/gulesider/article26548.ece)
   |
   |
   |
      
/index.php?option=com_extcalendar&extmode=day&date=2021-09-30
   Http Code: 200    Date: Sep 02 20:07:29    Http Version: HTTP/1.1    Size in Bytes: 50154
   Referer: http://www.cochleaklubben.no/index.php?option=com_content&task=view&id=5&Itemid=0&date=202
   Agent: Findexa Crawler (http://www.findexa.no/gulesider/article26548.ece)


Top
   
 
Posted: Fri Sep 02, 2005 7:59 pm 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
Maybe the "Findexa Crawler" is just not clever enough to respect the rel-attribute. The easiest may really to use SEF and when you then have a "virtual" sub-directory for the events you could simply forbid crawling there using the robots.txt.

_________________
http://de.siteof.de/


Top
   
 
Posted: Thu Sep 15, 2005 5:05 am 
Joomla! Apprentice
Joomla! Apprentice
Offline

Joined: Mon Sep 05, 2005 8:09 pm
Posts: 13
de wrote:
The easiest may really to use SEF and when you then have a "virtual" sub-directory for the events you could simply forbid crawling there using the robots.txt.


I really don't know much about this kind of stuff other than I know I don't want my calendar being crawled  :P ... so, what would this line look like in robots.txt?

Harry

_________________
Alex Person
Lincoln, Nebraska


Top
  E-mail  
 
Posted: Thu Sep 15, 2005 8:04 am 
User avatar
Joomla! Guru
Joomla! Guru
Offline

Joined: Fri Aug 19, 2005 5:23 pm
Posts: 553
Location: Gogledd Cymru
There's one way to stop the most persistent (or stupid) crawler/spider and that is to provide different content to the search engine!

Use the logic, described by macern, in your template to provide an alternative mosMainBody without the links so the robot can't do any crawling.

Alternatively you could use .htaccess to disallow requests from a robot that attempt to access the calendar.

Geraint

_________________
email: opensourcematters at copynDOTplusDOTcom


Top
  E-mail  
 
Posted: Fri Sep 16, 2005 12:43 pm 
Joomla! Ace
Joomla! Ace
Offline

Joined: Thu Aug 18, 2005 9:06 am
Posts: 1465
HarryP103 wrote:
I really don't know much about this kind of stuff other than I know I don't want my calendar being crawled  :P ... so, what would this line look like in robots.txt?

You would just add something like:
Code:
Disallow: /events/


... if the "virtual directory" for your events is "events" (you need SEF URLs for this).

_________________
http://de.siteof.de/


Top
   
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 20 posts ] 

Quick reply

 



Who is online

Users browsing this forum: No registered users and 16 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group