| Unicode over 60 percent of the web |
|
|
| Sunday, 05 February 2012 19:37 |
|
Computers store every piece of text using a “character encoding,” which gives a number to each character. For example, the byte 61 stands for ‘a’ and 62 stands for ‘b’ in the ASCII encoding, which was launched in 1963. Before the web, computer systems were siloed, and there were hundreds of different encodings. Depending on the encoding, C1 could mean any of ¡, Ё, Ą, Ħ, ‘, ”, or parts of thousands of characters, from æ to 品. If you brought a file from one computer to another, it could come out as gobbledygook.
Unicode was invented to solve that problem: to encode all human languages, from Chinese (中文) to Russian (русский) to Arabic (العربية), and even emoji symbols like Every January, we look at the percentage of the webpages in our index that are in different encodings. Here’s what our data looks like with the latest figures*:
*Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data.
As you can see, Unicode has experienced an 800 percent increase in “market share” since 2006. Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you’re surfing the web. We’ve long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we’ll be updating to that version and to Unicode’s locale data from CLDR 21 (both via ICU). The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it’d be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to help people find information in almost any language. Posted by Mark Davis, International Software Architect |
is an Internet advertising model used on websites, in which advertisers pay their host only when their ad is clicked. With search engines, advertisers typically bid on keyword phrases relevant to their target market.
refers to the statistical property that a larger share of population rests within the tail of a probability distribution than observed under a 'normal' or Gaussian distribution.
also referred to as i-marketing, web-marketing, online-marketing or e-Marketing, is the marketing of products or services over the Internet.
is a form of contextual advertising where specific keywords within the text of a web-page are matched with advertising and/or related information units.
Ad rotation
AdSense
Affiliate marketing
Article marketing
Article Video marketing
Blitz Campaign
Chinwag
Chitika
Content farm
Contextual advertising
Cost per time
Coupon Dispatch
Data pimping
Direct digital marketing
Display advertising
E-mail marketing
Emetrics Summit
Epostmailer
Future Ads
HubSpot
Image search optimization
Impression (online media)
In-text advertising
Infolinks
Interactive advertising
Internet presence management
Internet marketing
Keyword Services Platform
Kontera
Landing page
Landing page optimization
LeadPoint
Lightningcast
Link bidding
Long Tail
Lyris HQ
Micro content
Multivariate landing page optimization
Net Applications
New Media Strategies
Paid inclusion
Pay per click
Pay for placement
Pay per play
Pay Per Post
Phorm
Post-click marketing
Prime Visibility
Private label rights
Quality Score
Quinstreet
Radical trust
Red Ventures
Referral marketing
Resource Interactive
Revcube
Revenue sharing
Sales letter
Search Engine Marketing Professional
Search engine optimization
Semantic advertising
Semantic targeting
Sensis Agency
Social media optimization
Sonmate
Permission marketing
Inbound marketing
Double loop marketing
Traffic reporting
Transactional Link
Unboxing
Web analytics
Web banner
Web button