Duga3 - an extremely fast bittorrent crawler (and tracker) project
NoteI have recently taken a liking to git and will be using gitorious for all future updates. Feel free to fork, contribute, and send a pull request for merge. AboutÐ”ÑƒÐ³Ð°-3 / Duga-3 / Arc-3 is based on another project I started called "k2". k2 was based off of something else I had done a while back. So, this would be the third incarnation, hence the name fitting the project again (in more than one way). Finally commited to SVN on June 15, 2010. This initial code should be good enough to crawl a large amount of RSS feeds on torrent sites, parse and store the majority of the torrents info, and the like. I managed to get 43 sites to initially work, and included 7 plugins mostly for example purposes. It uses bz2, cURL, Dom, and MySQLi to achieve it's level of speed. The open tracker which is included as part of Duga-3, but isn't integrated into the crawler in any way. This tracker was forked off of the original Whitsoft opentracker code almost three years ago, and has since been almost rewritten entirely to utilize MySQLi and FULLTEXT searching heavily. Right now the tracker supports the draft "IPv6" paper from bittorrent.org, and an unofficial extension known as "compact scraping". Recent developmentsI have started a Drizzle port of this, with no plans to actually release it (yet). Current "state" of the projectAs of June 28, 2010, my best guesses are: Crawler: beta / stable (mostly stable) Tracker: alpha / beta I have also done extensive testing on FreeBSD, Linux, and Win32 installs (specifically using MySQL, nginx, and PHP each time). The only lacking feature is symlinking in the crawler (which can be disabled) for any versions of Windows below Vista - this is due to mklink being introduced in Vista... Get the codeThere are no plans to ever make any tarballed / zipped releasesI am using Subversion to store this project - this is required in order to get the code, however Subversion is freely available on a multitude of platforms, and is very easy to use. I also wrote some instructions below for new users. Windows users should use Slik SVN for the below instructions, or something besides TortoiseSVN. Everyone else should follow this link for instructions on installing Subversion for any given OS. RecommendedGet the entire project by running the checkout: svn checkout http://duga3.googlecode.com/svn/trunk/ duga3Since there are usually daily updates, stay up to date by moving your console into the directory you checked out into and run: svn updateDIYOtherwise, if you can handle it yourself, you can also use export to "checkout" the entire project without the .svn folders: svn export http://duga3.googlecode.com/svn/trunk/ duga3If you want just the crawler: cd /your/web/root/location#example search interfacesvn export http://duga3.googlecode.com/svn/trunk/index.php#admin interface, can be ran from anywheresvn export http://duga3.googlecode.com/svn/trunk/admin/index.php admin/index.php#the ccrawler itself, make this forbiddensvn export http://duga3.googlecode.com/svn/trunk/lib/crawler lib/crawler...or maybe just the tracker: cd /your/web/root/locationmkdir tracker #optionalcd tracker#client announce filesvn export http://duga3.googlecode.com/svn/trunk/announce.php#client scrape filesvn export http://duga3.googlecode.com/svn/trunk/scrape.php#the "stats" page you could use as an example to make a bnbt style front-endsvn export http://duga3.googlecode.com/svn/trunk/tracker.php#the tracker itself, make this forbiddensvn export http://duga3.googlecode.com/svn/trunk/lib/opentracker lib/opentrackerAdditional infoFinal notesPlease take note of the README, and TODO files in both lib/crawler/ and lib/opentracker/! Known "bug" in crawler: It's possible for fullscrape files to not get deleted, be sure to clean your CACHEDIR manually every once in a while. ContactThank you to everyone who has sent me positive feedback or just a thanks, but I have removed my email from this page due to increasing levels of spam. My username is on the right ("Owners"), I think you can figure out how to send me an email from there ;)
comments powered by Disqus
ROME is an set of Java tools for parsing, generating and publishing RSS and Atom feeds. The core ROME library depends only on the JDOM XML parser and supports parsing, generating and converting all of the popular RSS and Atom formats including RSS 0.90, RSS 0.91 Netscape, RSS 0.91 Userland, RSS 0.92, RSS 0.93, RSS 0.94, RSS 1.0, RSS 2.0, Atom 0.3, and Atom 1.0. You can parse to an RSS object model, an Atom object model or an abstract SyndFeed model that can model either family of formats.
Tiki Wiki CMS Groupware is a full-featured, web-based, multilingual (35+ languages), tightly integrated, all-in-one Wiki+CMS+Groupware using PHP, MySQL, Zend Framework, jQuery and Smarty. Actively developed by a very large international community, Tiki can be used to create all kinds of Web applications, sites, portals, knowledge bases, intranets, and extranets.
Yioop is an open source, PHP search engine capable of crawling, index, and providing search results for hundred of millions of pages on relatively low end hardware. It can index a variety of text formats HTML, RSS, PDF, RTF, DOC and images GIF, JPEG, PNG, etc. It can import data from ARC, WARC, Media-Wiki, Open Directory RDF. It is easily localized to many languages. It has built-in support for new feeds, discussion groups, blogs, and wikis. It also supports mixing indexes to create mash ups.
Large scale server deploys using BitTorrent and the BitTornado library
Pligg is an open source CMS (Content Management System) which provides social publishing software that encourages visitors to register on your website so that they can submit content and connect with other users. Our software creates websites where stories are created and voted on by members, not website editors. It is a user driven CMS that relies on independent authors content and participation to manage news articles.
Sphinix is free open-source SQL full-text search engine. How do you implement full-text search for that 10+ million row table, keep up with the load, and stay relevant? Sphinx is good at those kinds of riddles.
Simple Machines Forum (SMF) is a free, professional grade software package that allows you to set up your own online community within minutes. Its powerful template engine provides a unique look and feel to the site.