Matt Cutts #15: Data center comments
Here’s the fifteenth in the series of videos posted by Google’s Matt Cutts to Google Video over the past year. These are important for every web developer to see. Please see Matt’s first!
Transcription
Hey Everybody! Good to see you again!
I thought I would talk about datacenter updates, what to expect for the next few weeks in Google and stuff like that this time.
But before I do, I didn’t get to talk about fun schwag from the Search Engine Strategies conference. One of my favorites, check it out (holds up a a hat), its a white hat. Oooh! It got SEO in hidden text. Don’t say SEOs don’t have a sense of humor.
I thought this one was kind of fun (holding a picture), picture of Jake Baillie fake autograph there and here I got a real autograph. Infact I got several of them. Oh Yes. What can I do with lots of pictures of Jake Baillie? May be I can sell them and do some arbitrage or something like like that.
Anyway!
Also there was at least one British SEO, who evidently wants to keep me from doing anything productive for a long long time. Check that out (holds up a stack of three voluminous books). That’s three thousand five hundred plus pages of science fiction. Huh. Yes. The funny thing is, in Briton, these three books are published as three books and in United States, they take these three books and publish them as nine books. What does that say about British readers versus American readers? Yes, that’s what I thought. So probably I donate this to the Webspam team whoever needs some hard SEO, hard SciFi I should say.
OK! Data Center Updates.
So, There are always updates going on, you know, practically daily, if not daily, a small portion of our index is updated every day, not small portion but a pretty large fraction of our index is updated everyday as we crawl the web. We also have algorithms and data pushes that are going out on a less frequent basis.
So, for example, there was a data push on June 27th, July27th and then on August 17th. And again, its an algorithm that’s running for over 1.5 years. If you seem to be caught in that, you are more likely to be reading on an SEO board. So, you might want to think about ways that you could back your site off, think less about what the SEOs on the board are saying and how you can sort of not be optimizing quite quite as much on your site. That’s about as much as the advise I can give, I am afraid!
BigDaddy was a software infrastructure upgrade and that upgrade was finished around in February. And so it was pretty much a refresh to how we crawled the web and how we partly index the web. That’s been done for several months and things have been working quite smoothly.
There was also a complete refresh or update of our supplementary results index infrastructure. That happened a couple of months after BigDaddy, So it is been done for a month or two and it was a complete rewrite. So the indexing infrastructure is different than our main indexing infrastructure. So, you expect to see few more issues whenever we roll that out. We saw, you know, more small, off the beaten path stuff, like minus or exclusion terms where you use the minus sign, the no index meta tag, stuff like that. And the way that the supplementary results worked with the main index, you would often see site:results estimates that were too high.
There was at least one incident where there was a spammer that some people thought had 5 billion pages and whenever I looked into it, the total number of pages that their biggest domain had under 50000 pages. So they have been adding up these site:estimates and ending up with a really big number, that was just way, way off.
So, one nice thing is we have another software infrastructure update, which improves quality as the main aspect but it also improves our site: result estimates as well. Its just sort of like a side benefit. I know that, that is not at all data centers in the sense that it can run in some experimental modes, but its not fully on at every data center. And, they were shooting for the end of the summer to have it live every where, but again, that’s a hope, not a promise. So, if things need more testing, they will work for longer to make sure that everything goes smoothly. And if everything goes great,then they might roll it out faster. But, that is a really nice infrastructure. Its just a side benefit that site: result estimates get more accurate.
Its kind of interesting, let me talk about it for a minute, because I saw at least one guy who had said, you know, “what happened with site: result estimates on Google” and he was comparing two completely different data center IP addresses and they were different and he was worried about that. And yet, he had exactly one page in Yahoo, he had no pages in Ask. If you look at his link page, there were a ton of links to pharmacy sites, not just one pharmacy site, but a lot of pharmacy sites.
And so, I would say, your time, your focus, is better spent looking at your server logs, asking how to improve the quality of your own site and not worrying about something like site: results estimates.
So let me drill down some reasons, why that’s true.
Number one. They are estimates. We don’t claim that they are exact. In fact, if you look at them they are only exact to three significant digits. And we do that to give people an idea of how many results there are from a ’site:’ query. But, we don’t claim that that’s a 100% precise.
And truthfully, I didn’t consider it very high priority. There was recently a change that was pushed out that made the plain old results estimates much more accurate for unigram or single word queries. And I spent about half hour with the guy who did the change. And he even asked me, “well do you think its worth working on making the results estimates for site: more accurate?”.
And this was like 5,6 months ago,may be eve more. At that time I said, “No!pretty much nobody pays attention to those. You know, they look at their server logs, its not really a high priority”. And its gotten to be where more people are asking about these things and I am sure we will pay more attention to it.
But, in general I would spend more time worrying about good content on your site, looking at your server logs to find out niches where you can make new pages and make things that are more relevant.
And you know, the whole notion of watching data centers is going to get harder and harder for individuals going forward, because number one we have so much stuff launching in various ways. I have seen weekly once launchings where there are double digit number of things, and these are things that are under the hood. So, strictly quality. They are not changing the UI or anything like that. And so, if you are not making a specific search in Russian or Chinese, you might not notice the difference. But it goes to show that we are always going to be rolling out different things and at different data centers you might have slightly different data.
The other reason why its not worth watching data centers is because there is an entire set of ip addresses and if you are a super-dooper gung-ho SEO, you’ll know, you know, oh, 72.2.14.whatever. But that IP address will typically go, to one data center. But that’s not a gaurantee. If that one data center comes out of rotation, we are going to do something else to it, we are going to actually change the hardware infrastructure. and everything I have been talking about so far is software infrastructure. So if you take that datacenter out of rotation for some reason, that ip address will then point to a completely different data center. So, the currency, the ability to really compare changes and talk to a fellow data center watcher and say, “What do you see at 72.2.14.whatever” is really pretty limited.
So I would definitely encourage you to spend more time worrying about you know, the results you rank for, increasing the quality of your content, looking for high quality people that you think should be linking to you and may not even know about it and stuff like that.
I just want to give people a little bit of update on where we were on various infrastructure and the fact of the matter is that we are always going to be working on improving our infrastructure, so you can never guarantee a ranking or a number 1 for any given term, you know, because, if we find out that we think we can improve quality by changing our algorithms or data or infrastructure or anything else, we are going to make that change.
So the best SEOs in my experience are the ones that can adapt and they would say, “Ok, if this is the way the algorithms look right now to me, and if I want to make a good site that will do well in search engines, this is the direction I want to head in next.” And if you work on these sort of skills, then you don’t have to worry about being up at 3:00 AM and talking on a forum about “What does this data center look like to you?, Did they change a whole lot” and stuff like that.
So that’s the approach that I recommend.
Transcription thanks to Peter T. Davis