Well my theory was right. Facebook blocked API access to Apple for an API that is normally open.

This is what I said just a few hours ago:

If you want to use them at SCALE or have a serious application (like Apple) these companies want you to execute custom terms of service to prevent you from getting kicked off their platform at some point in the future.

And Kara Swisher notes:

According to sources familiar with Facebook’s platform, the social networking giant essentially denied Apple’s Ping access to application programming interfaces that would allow it to search for an iTunes user’s friends on Facebook who also had signed up for Ping.

Of course if Apple were to DoS the resources of Facebook this could mean that Facebook needed to block them which Kara notes:

That is, unless some entity wants to access it a lot. In that case, Facebook requires an agreement for reasons primarily centered on protection of Facebook user data and, of course, infrastructure impact.

I think it would be fair for Facebook to charge for access, but these should be open and clear cut form day zero. Facebook has the right to do what they want with their platform of course but as developers we have the right not to code to if it the API terms are too aggressive (and this looks like the case for Apple).

Apple via Steve Jobs is saying that the terms of integration with Facebook were ‘onerous’ so it was dropped at the last moment.

We’re hearing reports of people who had access to a Facebook Connect feature in Ping earlier, which didn’t work, and has since been removed. So it looks like Apple really did pull Facebook support for Ping very late in the game. It’s still even mentioned on the Ping promo page. (And, via Peter Kafka, here’s a link to the Ping app on Facebook, which doesn’t currently seem to work.)

This is exactly why the social web needs to be fully open. Right now you don’t need to execute a terms of service to index the web, to publish HTML, to link to websites. But you DO need to agree to a terms of service to use Twitter, Facebook, etc.

If you want to use them at SCALE or have a serious application (like Apple) these companies want you to execute custom terms of service to prevent you from getting kicked off their platform at some point in the future.

You’re basically a share cropper. You are making money while working on someone else’s property. If they want to charge you EXCESSIVE rent you’re either forced to pay it or you have to go out of business.

Not fun.

This is why protocols like Salmon, PuSH, OAuth. and Status.net are so important.

Hopefully in a few years we look back at this situation and laugh at how silly it was – WAY back in 2010 when the social web was proprietary.

Facebook crawls your like button enabled web pages which can include FBML.

Which is super cool of course.

It’s cool that Facebook is crawling the web.

When does Facebook scrape my Page? “Even if you specify a longer time, Facebook will scrape your page every 24 hours.”

The user agent of the scraper is: “facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php

However, you can’t crawl Facebook. They’re blocking everyone by default (via robots.txt) still.

Unless you’re one of the big boys, you can’t play with Facebook unless you’ve been explicitly whitelisted.

It’s not really the end of the world until you realize that the ENTIRE web would melt if everyone required explicit permission from millions of websites just to crawl the web.

It’s a beg for permission and not a beg for forgiveness model which doesn’t scale.

How do sorting algorithms sound?

Spamming Twitter for Fun and Profit.

This guys spams twitter with book recommendations:

Two winters ago I left a position as a system administrator that was paying pretty well and moved cross-country to a region with less jobs than where I moved from. Three months later, I was still unemployed, broke, and bored. I was talking to my good friend Japhy on IRC one day and he was explaining to me how the tf-idf algorithm works. For reasons involving boredom more than any other reason, I dreamed up an idea: I would write software that would take a given document and generate book suggestions based on its content.

I think that most programmers would agree with me that we put in longer hours on code when we’re not working for anybody. We don’t stop learning, either. To us, unemployment is a brief sprint of academia spent in our home office, the local coffee shop, or our parent’s house.

My imagination dreamed up this fairly straightforward process:

Take a given document and calculate tf-idf scores on all terms
Select X number of the highest scoring terms
Pass these high-scoring terms to an Amazon ItemSearch query
Receive a list of recommended books (with URLs) from Amazon

Spam on Twitter is becoming more of a problem due to the nature of @replies being very low barriers to entry. There are numerous measures to help block this but it’s still becoming a problem.

Which Twitter users are influential?

The algorithm described in the research is unique in that it incorporates what the authors call “passivity.”
 
The study found that a large majority of Twitter users act as passive information consumers and rarely forward (“retweet”) content to the network.  To become influential, users must not only catch the attention of their followers; they must also overcome their followers’ predisposition to remain passive.

There are other ways to compute this data. Retweets and @ replies can also yield results for finding influential users.

Hypertable At Facebook

There’s going to be an interesting Hypertable meetup at Facebook

We (the Hypertable Development team) will be presenting the recently developed Hypertable Hive extension. In this presentation, we will give an overview on how to use it and will provide details on the design.

We will also be discussing the results of the recent Hypertable vs. HBase Performance Evaluation. Hypertable’s relative performance numbers were very impressive. We will describe the test in detail and hold a Q&A session after the presentation.

LinkedIn and Hadoop

Interesting post about LinkedIn’s Data Infrastructure

Much of LinkedIn’s important data is offline – it moves fairly slowly. So they use daily batch processing withHadoop as an important part of their calculations. For example, they pre-compute data for their “People You May Know” product this way, scoring 120 billion relationships per day in a mapreduce pipeline of 82 Hadoop jobs that requires 16 TB of intermediate data. This job uses a statistical model to predict the probability of two people knowing each other. Interestingly they use bloom filters to speed up large joins, yielding a 10x performance improvement.

They have two engineers who work on this pipeline, and are able to test five new algorithms per week. To achieve this rate of change, they rely on A/B testing to compare new approach to old approaches, using a “fly by instruments” approach to optimize results. To achieve performance improvements, they also need to operate on large scale data – they rely on large scale cluster processing. To achieve that they moved from custom graph processing code to Hadoop mapreduce code – this required some thoughtful design since many graph algorithms don’t translate into mapreduce in a straightforward manner.

The workflow management systems are starting to become interesting as jobs with this many steps and intermediate data can become confusing quickly.

Bootstrapping Github

This is my kind of startup:

We offer public and private source code hosting to companies and open source projects using either git or Subversion. What we want to do is lower the barrier of contributing to projects, both public and private. Submitting a patch to an open source project should be about the code, not the process of submitting the patch. Working with your coworkers, either in the same office or across the world, should be about moving your project forward and not about managing clumsy tools.

We also offer git training, provide git learning materials, and sponsor open source projects.

It’s so cool when people can control their own destinies.

On URL Redirection Services

DeWitt writes a great post about the thread of URL redirection services like tinyurl and bit.ly.

I am personally and professionally concerned about their emergence at scale and the negative impacts they could have on our ecosystem and it’s probably time I spoke up.

More specifically, I’d like to touch on and expand on some things that DeWitt mentions.

Potential censorship

Having a central point for URL redirection of such a vast number of URLs means that potential censorship issues can arise.

What happens if Facebook wants to build a search engine and bulk URL resolve against a 3rd party URL redirection service?

What happens if the remote system can’t keep up with the performance requirements for bulk URL resolution?

We’ve already seen this with bit.ly. For a feature we were evaluating at Spinn3r we weren’t able to bulk resolve bit.ly URLs as fast as they were being created.

Scale and Single Point of Failure

This puts a large number of the URLs in the hands of one player and makes their scale requirements higher. It’s even more frustrating that this didn’t need to happen in the first place since the web worked pretty well without URL redirection services to begin with.

140 characters not required for uplevel clients

What I think is most frustrating is that the use of tiny URLs isn’t required for most clients.

Only downlevel SMS clients that can only support HTML and have a firm 140 character limit need to use the tiny URL.

For uplevel/advanced clients they can just render HTML with either the the tiny URL as the href and the long URL as the text of the link OR they can use the long URL in both the text and the href and then add an onClick handler and record the click in Javascript.

I would strongly recommend that Twitter adopt this second model. This would still give them most of the benefit from their URL tracking and tiny URL service but at the same time preserve the machine readability of the web and avoid all the pitfalls of these services.

We’re Open Sourcing some command line tools for accessing some modern system calls on Linux.

The project is named linux-ftools and is over on Google Code.

You’ll have to checkout the hg repo locally and build yourself for now.

Specifically fallocate, fadvise, and fincore (and a few other smaller tools).

They need a bit more cleanup , specifically error handling (and maybe a few more features).

I’d like to try to get these into Debian.

These were designed for use with MySQL but you might be able to use them in your setup as well.

I’ve been meaning on posting about this for a while now but I finally have a good tool to help visualize this problem (seekwatcher).

MyISAM continues to append to the .MYD file as you write to it. Which seems pretty easy to manage from a performance standpoint because if you’re writing 1 file on one disk it will be 100% contiguous.

But what happens if you’re writing 100 files? or 1000? The file becomes fragmented on disk (in a more pure sense, a fresh disk) because each new write is stacked up on top of the previous file’s write.

What needs to happen is that MyISAM needs to fallocate 5-10MB at a time. This way for at least the next 5MB you have a large chunk of contiguous disk to use.

This isn’t just theoretical. Check out the following video. This is on a 11 disk RAID server with 3 tables being written, all MyISAM, two files per table (MYD and MYI) and also the binary log.

This just takes the MYD file and runs dd redirecting to /dev/zero.

I then read the first 4GB from the .MYD and wrote a NEW file and then ran cat redirecting THAT to /dev/zero

As you can see the second video is nearly 100% sequential (and beautiful to watch).

I have to admit that FlashCache for Linux looks pretty cool.

It basically lets you use a block device SSD as cache.

Another hack is to mount the SSD as swap and tell InnoDB to use say 100GB of memory. I haven’t tested this but it might be a fun hack :)

We’ve actually migrated away from using SSD in production – at least for now.

The performance just wasn’t THAT great in our configuration. I’m still semi optimistic but not as much as I was a year ago.

I think if I were to do it again I would drop RAID everywhere with SSD and make sure my database layer can correctly route queries to the right MySQL instance.

Each SSD would need to be on its own database server with its own replication thread.

We’re also ditching use of caches and instead requiring that all of the applications memory reside in memory. It’s way cheaper on a per-IOP basis this way and just a lot easier to admin.

The High Follower Fallacy

This really isn’t news but it’s nice to see more people talk about this problem:

Cha called her paper, “The Million Follower Fallacy,” a term that comes from work by Adi Avnit. Avnit posited that the number of followers of a Tweeter is largely meaningless, and Cha, after looking at data from all 52 million Twitter accounts (and, more closely, at the 6 million “active users”) seems to have proven Avnit right. “Popular users who have a high indegree [number of followers] are not necessarily influential in terms of spawning retweets or mentions,” she writes.

You can see this in our the Spinn3r Social Media Rank that we released a while back.

Seekwatcher is a super cool utility for Linux that allows you to visualize the IO patterns on a running Linux box.

Check out the movies that are on the page.

This is all based on blktrace which is uses debugfs which is available on modern kernels.

I still need to get a vfsstat program like iostat and blktrace which shows files and offset and block length reads. From this I can then compute VFS page cache efficiency.

All the cool kids are talking about the Real Time Web. There’s a lot of innovation here but one of the main problems is that all the data is being held by two main players – Twitter and Facebook.

It’s not really an open network. If you need to launch a startup on the Real Time Web you have to get the permission of Facebook or Twitter.

What if Google had to get the permission of Microsoft to index the web and launch their search engine?

What if they were still dependent on this data from Microsoft?

I’m sure MS would be throwing them to the wolves and demanding a LARGE revenue share of the search giant.

A lot people don’t feel that Facebook’s new Open Graph is really open:

Following Facebook’s big Open Graph announcements at f8 a couple days ago, many of the leaders of the so-called “open web” are taking exception to Facebook’s use of the term “open” for its grandiose plans. While the Open Graph may be a lot of things, it is not open, is the feeling many of them have, as Erick laid out earlier.

… I’m looking forward to more and more people to notice that the open spirit of the web that we’ve grown to depend on is slowly being taken from us.

I don’t blame Twitter and Facebook for making these decisions. If I were them I would be tempted to make the same decisions.

However, it’s in the best interest of consumers to insist on the Real Time Web being open. The more Real Time search engines , aggregation companies, ranking companies, twitter clients, etc, the better for everyone concerned.

An open “like” standard

The social networks are sucking more and more of the attention data on the web.

Perhaps we need an open ‘like’ standard?

Mark Zuckerberg in his keynote at Facebook’s f8 conference this week did his best to convince attendees that the launch of “social plugins” powering a billion or more “Like” buttons across the web was the best thing that could ever happen to the Internet. Not everyone was sold on the idea, however. To some, it sounded like a company that wanted to get its proprietary hooks into every corner of the web and suck users’ activity data back to the mothership. That’s fine for Facebook, of course, but not so fine for anyone else who’s interested in that information, and doesn’t want to have to go to Zuckerberg on bended knee and ask for it.

I spend the last couple days playing with InnoDB page compression on the latest Percona build.

I’m pretty happy so far with Percona and the latest InnoDB changes.

Compression wasn’t living up to my expectations though.

I think the biggest problem is that the compression can only use one core in replication and ALTER TABLE statements.

We have an 80GB database that was running on 96GB boxes filled with RAM.

I wanted to try to run this on much smaller instances (32GB-48GB boxes) by compressing the database.

Unfortunately, after 24 hours of running an ALTER TABLE which would only use one core per table, the SQL replication thread went to 100% and started falling behind fast.

I think what might be happening is that the InnoDB page buffer is full because it can’t write to the disk fast enough which causes the insert thread to force compression of the pages in the foreground.

Having InnoDB only use one core / thread to compress pages seems like a very bad idea (especially on 8-16 core boxes, I’m testing on an 8 core box now but we have 16 core boxes in production).

The InnoDB page compression documentation doesn’t seem to yield any hints about when InnoDB pages are compressed and in which thread. Nor does there seem to be any configuration variables that we can change in this regard.

Perhaps a ‘compressed buffer pool only’ option could be interesting.

This way InnoDB does not have to maintain an LRU for compressed/decompressed pages. Further, it can read pages off disk, decompress them, and then leave the pages decompressed in a small buffer. Then a worker thread (executing on another core) can compress the pages and move them back into the buffer pool where they can be stored and placed back on disk.

This process could still become disk bottlenecked but at least it would use multiple cores.

It seems to be impossible to perform client-side paging full table scans within MySQL.

For example, say you want to take a 1GB file and page through it 10MB at a time.

With a flat file I could just read 10MB off disk, read the next 10MB, etc.

This would be amazingly efficient as you could write the data sequentially, without updating indexes, and then read the data back out by byte offset.

You could page through MySQL tables by adding an index to a table:

SELECT * FROM FOO WHERE PTR > 10000 LIMIT 10000;

for example … but I REALLY want to avoid an index step because it is not cheap and only required since MySQL doesn’t support cursors.

This index slows down the import stage and I have to buffer the index data in memory which is just a waste.

I could use LIMIT with OFFSET but this isn’t going to be amazingly efficient because it will either require us to use a temporary on disk table or force us to read each row off disk.

Technically, one could implement offset efficiently with a fixed with table since you can compute the byte offset as row_width*N but I don’t think MySQL implements this optimization.

Further, my tables are are variable width.

If MySQL supported cursors then I could just tell MySQL to perform a table scan and read the data from a cursor.

I’m implementing something similar to a Map Reduce job in MySQL and Java which would work VERY well on Hadoop but is nearly impossible to implement with MySQL efficiently since I can’t page through data without over-indexing my tables.

Thoughts? I would love it if there’s some stupid solution that I’m just missing.

I suspect this will become more and more of a problem as more developers want to just flat out Map Reduce their data.

Update:

I was wrong. The solution is HANDLER. Ryan Thiessen nailed it in the comments.

Spinn3r is hiring for an experienced Senior Operations and Infrastructure Engineer with solid Linux and MySQL skills and a passion for building scalable and high performance infrastructure.

This role is about 80% engineering in future infrastructure work (so that we can scale into the future) and 20% routine operational tasks.

It’s a great opportunity for the right candidate. The fact that most of this will be engineering and building new systems should make it a lot more challenging than a traditional operations role.

My goal is to spend the right amount of time doing long term infrastructure work so that fires and machine failures are fully routine and require no human interaction.

Right now they require little human interaction but it’s still required in some of the more serious circumstances.

Next Page »