Q: We have some private materials on our site, but we’re not able to use robots.txt or a robots meta tag to disallow the pages from indexing. As long as we don’t link to the pages, and only send them out as links in emails, will Google have any way of indexing these pages?
A: As we’ve said many times, the only way to keep materials truly private online is to password-protect them. Even if you don’t link to your pages, here are a few ways that your private URLs might find their way into Google’s index:
- People have reported seeing links from within Gmail messages spidered by Google, although this isn’t something that we have experimented with first-hand. We do know that Google spiders Gmail content emails and uses “content extraction” in order to match advertisements, but there is no documentation of other uses of the spidered content.
- Private URLs can be seen by Google in other odd ways. For example, if somebody clicks from your private page to another website, then your private URL would show up as a referrer in the server logs. Some server logs are public, or find their way into the public realm, and that could expose the private URL.
- If someone visits the private URL while having the Google toolbar activated, then the URL could get collected and find its way into Google’s index.
- If the link is included in a listserv email (seemingly private), then that could be scraped and republished on the web.
- One of your authorized visitors could post a link within a forum post, on Facebook or Twitter, or somewhere else that he/she believes is private or semi-private, and that link could eventually be followed and indexed.
As you can see there are a lot of possible sources for leaks!
The best safeguard would be to password protect the individual pages, and your second-best approach is to deindex using the robots meta tag. Without password protection, robots.txt or meta robots to prevent indexing, your next best line of defense is to watch for indexing and then do one of two things:
- remove any files that have been indexed and put them in a different URL; or
- place all of the private content in the same folder on your server, and then use Google Webmaster Tools and Bing Webmaster Tools to remove the files or folder from the indexes if/when it gets indexed.
These should give minimize your exposure in Google, but if your materials are truly confidential, you need password protection. And without a doubt, ixnay on the social security umbers-nay!