by Kaj Kandler
If you run a website of some mild success, then you have come across so called “scraper” sites. A scraper site copies content form RSS feeds and potentially the web pages of a site and re-publishes it as their own content. Tonight I read a blog post about “benign scraper sites” by AK John.
Scraper sites hope to attract visitors that then click on advertisement and so make money for their owners. If they are combined with Search Engine Optiomization, they can outrank the original. Scraper sites are certainly a violation of copyright. John thinks that even benign scrapes, those that link back to the original source are harmful duplication of content that cloggs the arteries of the Internet.
When I also read Johns recent post on Google’s ambitions with “AuthorRank and the rel=author verification”. It became clear to me that Google can/will use the author verification of content to know which site has the original content and which site has the copy. Because the Google+ Author profile will point back only to the original site.
So to outrun the Scraper sites I will claim author ship for my content.
Here is the question for my readers, will Google be able to detect if the scraper site sets up fake Google+ profiles and modifies the author links? Does Google have a way to detect who published first?
by Kaj Kandler
Tonight I happened to read an article that made a claim about the BestBuy.com website and its use of certain semantic web technology. I was curious how they employed the technology so I looked at one of their web pages for a random TV.
I was amused that even such a large retailer could make some simple mistakes. I found numerous places where invalid HTML was used, due to using reserved characters in regular text. Proper HTML should use substitues called entities. The error is triggered by a TV’s screen size being measured in Inches, which is often expressed with the double quote sign (“). However the double quote is a reserved character in HTML and so needs to be replaced by " where ever it is used.
Here are a few examples from BestBuy.com
<meta name="keywords" content="DYNEX, 42" Class / LED / 1080p / 60Hz / HDTV, DX-42E250A12, 30"+ Televisions, Televisions" />
<meta name="description" content="DYNEX 42" Class / LED / 1080p / 60Hz / HDTV: 2 HDMI inputs; 1080p resolution; 160-degree horizontal and vertical viewing angles" />
<li class="property included-item">Dynex™ 42" Class / LED / 1080p / 60Hz / HDTV</li>
Its funny that the page encodes one special character properly (the Trademark symbol as ), but not the other. But then in other places it messes up the trademark symbol and encodes the double quote correctly
<meta content="Dynexâ„¢ 42" Class / LED / 1080p / 60Hz / HDTV" itemprop="name"/>
As it happens this error is in the area of code I was interested in. And yes, in one place both are correct.
Dynex™ - 42" Class / LED / 1080p / 60Hz / HDTV - DX-42E250A12</title>
If you read the source code it is peppered with things like tracking codes and semantic web data to make it attractive for search engines and other programs that analyze code automatically. I think these encoding mistakes do mitigate those efforts to a certain degree.
For that reason I check all (most of) my pages with an HTML syntax validator. Not that I correct all mistakes, because most browsers can handle some of the mistakes just fine (including this one, except for the third example). However, every browser (and other programs reading HTML, such as search engine crawlers) is different in their ability to handle invalid code. So I try to take as little chances as necessary.
by Kaj Kandler
Last summer I went to the first BarCamp Boston. I had a great time there and did not want to miss BarCamp Boston 2 this past weekend.
BarCamp Boston 2 was held at MIT Stata Center, the famous building by architect Frank O. Gehry.
The rules for a BarCamp an unconference of geeks are simple. Every participant can chair a session, discussion or provide a lightning talk. The organizers have set aside a few appropriate meeting rooms and a schedule on a blackboard where one can read the program and add one self to the offering. In addition the organizers and sponsors did provide us with food and refreshments.
Next, I attended “Open/Collaborative/Green Mapping” by Jerrad Pierce. I had met Jerrad earlier in the hall where he presented his maps and had talked him into presenting his experience with this project in a session. He has created a Green Map of Cambridge, as part of the GreenMaps initiative. He also wrote his thesis on the subject of a better index to points on the map. Jerrad had 45+ interested listeners and a lot of questions where asked. How did he get the data from public sources? What tools did he use? What other tools he could recommend, especially those that where available at no cost?
Amanda Watlington presented before the afternoon break about “Video – How to Make It Found in Search Engines”. She stressed that video and audio files become more important to search as people use the web increasingly to consume media. So she told webmasters that it is important to annotate the media assets with internal and external keyword tags and to write, if possible, a transcript from the media and post it on a page that contains the file. In addition she recommended to submit the media file to specialty search engines, in order to make it available to the searching public.
My last session for the day was “Financing your Startup” by David Kaufman. It wasn’t all new, but certainly a comprehensive overview of how to finance your startup. I took away the following tidbits of wisdom: “Revenue or advanced financing by your (future) customers is the best way to survive the first phase” and “VC financing is only appropriate if you can show a very fast adoption curve and a large market.” Typically VCs want to invest X Millions and have that returned 10 fold within 3 to 5 years. If your business model does not show a plausible case for this kind of development, do not spend (waste) your time with talking to VCs. In addition, think about who the VC would potentially sell his share in the company? It helps to know who would be a potential buyer, especially as the default exit strategy of an Initial Public Offering (IPO) is not as available as it used to be.
Unfortunately, I was not able to attend the socializing in the evening, as I had prior commitments.