We’re hiring a Crawl Engineer at Spinn3r. This is a key hire (and will take a lot of work off my shoulders) so we plan on taking our time to find the right candidate.
That said, it’s an awesome opportunity to get in and work on a rapidly growing startup.
Spinn3r is a licensed weblog crawler used by search engines, weblog analytic companies, and generally anyone who needs access to high quality weblog and social media data.
We crawl the entire blogosphere in real-time, rank, and classifying blogs, as well as remove spam. We then provide this information to our customers in a clean format for use within IR applications.
Spinn3r is rare in the startup world in that we’re actually profitable. We’ve proven our business model which gives us a significant advantage in future product design and expanding our current customer base and feature set.
We’ve also been smart and haven’t raised a dime of external VC funding which gives us a lot more flexibility in terms how how we want to grow the company moving forward.
For more information please visit our website.
- Maintain our current crawler.
- Monitor and implement statistics behind the current crawler to detect anomalies.
- Implement new features for customers
- Work on backend architecture to improve performance and stability.
- Implement custom protocol extension for enhanced metadata and site specific social media support.
- Work on new products and features using large datasets.
Requirements and Experience:
- Java (though Python, C, C++, etc would work fine).
- HTML, XML, RSS or Atom.
- Distributed systems.
- Databases (MySQL, etc).
- Algorithm design (especially in distributed systems).
- Ability (and appreciation) for working in a Startup environment.
- Must like cats
- Past experience running and working with large crawlers.
- Understanding of IR algorithms (K-means, naive bayes, inverted index compression, etc).
- Experience within the Open Source community.
For bonus points. Feel free to answer the following questions in your email:
- You have a live corpus of text and HTML from 25M weblogs. You want to cluster these weblogs into logical communities (tech, politics, entertainment, etc). How (and what algorithm) would you cluster and rank the content within a reasonable time? What computational resources would this require (memory, CPU, network bandwidth, etc).
- You are building a ranking algorithm. This algorithm will execute across a large link graph. How would you store the graph to use the smallest (yet reasonable) amount of computational resources (memory and CPU).