Former Sub-projects of Existing Top-Level Projects
- Apache Mahout provides scalable implementations of machine learning algorithms on top of Apache Hadoop and other technologies. It offers collaborative filtering, clustering, classification, feature reduction, data mining algorithms, and more. Begun as a sub-project of Lucene in 2008, Mahout's team of nearly a dozen contributors is now actively working towards release 0.4.
- Apache Tika is an embeddable, lightweight toolkit for content detection, and analysis. Powering by MIME standards from IANA, advanced language detection features and on the ability to rapidly unify existing parser libraries, Tika provides a one-stop shop for navigating the modern information landscape. Tika entered the Incubator in 2007 and graduated to a Lucene sub-project in 2008. Tika is used in a broad range of Lucene products ranging from Solr, to Nutch and Mahout and is in deployment at NASA, Day Software, the Internet Archive, and at a number of Web startups including Bixo labs.
- Apache Avro is a fast data serialization system that includes rich and dynamic schemas in all its processing. A sub-project of Apache Hadoop, Avro features rich data structures; a compact, fast, binary data format; a container file to store persistent data; remote procedure call (RPC); and simple integration with dynamic languages. Not only is code generation not required to read or write data files nor to use or implement RPC protocols, it is an optional optimization, only worth implementing for statically typed languages.
- Apache HBase is a distributed database modeled after Google's Bigtable. The project started at Powerset and became a sub-project of Apache Hadoop in 2007. Apache HBase adds random read/write access to the Hadoop stack, extending offline processing capabilities and enabling realtime serving of very large datasets. The project's goal is the hosting of big tables -- billions of rows X millions of columns -- running atop commodity hardware. HBase has been successfully deployed at Adobe, Flurry, Meetup, Mozilla, StumbleUpon, Trend Micro, and Twitter, among others, to perform analytics and as a datastore for live Websites.
Additional New Top-Level Projects Created in 2010
- Apache UIMA (Unstructured Information Management Architecture) is a framework for analyzing unstructured information, such as natural language text. It supports the writing, deployment and reuse of analysis components in a wide variety of settings. Created at IBM and submitted to the Apache Incubator in 2006, UIMA has been adopted as the de-facto enabling platform by a significant part of the natural language processing community. Apache UIMA graduated from the Apache Incubator in March 2010.
- Apache Cassandra is an advanced, second-generation “NoSQL” distributed data store that has a shared-nothing architecture. The Cassandra decentralized model provides massive scalability, and is highly available with no single point of failure even under the worst scenarios. Originally developed at Facebook and submitted to the ASF Incubator in 2009, the Project has added more than a half-dozen new committers, and is deployed by dozens of high-profile users such as Cisco WebEx, Cloudkick, Digg, Facebook, Rackspace, Reddit, and Twitter. Apache Cassandra graduated from the Apache Incubator in March 2010.
- Apache Shindig is an OpenSocial container and helps you to start hosting OpenSocial apps quickly by providing the code to render gadgets, proxy requests, and handle REST and RPC requests. By providing a language-neutral infrastructure for those wishing to host OpenSocial applications on their Websites, Apache Shindig allows new sites to start hosting social apps in under an hour. Originally created as a port of Google's iGoogle gadget container for hosting OpenSocial compatible widgets in any Website, Shindig entered the Apache Incubator in 2007, and graduated in January 2010.