When your torrent client joins the swarm to share and gather file pieces, how exactly does it know where all its peers are? Read on as we poke around inside the mechanisms that undergird the BitTorrent protocol.
Today’s Question & Answer session comes to us courtesy of SuperUser—a subdivision of Stack Exchange, a community-driven grouping of Q&A web sites.
The Question
SuperUser reader Steve V. had a very specific question about the Distributed Hash Table (DHT) system within the BitTorrent protocol:
His question in turn prompted a really detailed reply about the different functions of the BitTorrent system; let’s take a look at it now.
I understand the idea of a tracker: clients connect to a central server which maintains a list of peers in a swarm.
I also understand the idea of peer exchange: clients already in a swarm send the complete list of their peers to each other. If new peers are discovered, they are added to the list.
My question is, how does DHT work? That is, how can a new client join a swarm without either a tracker or the knowledge of at least one member of the swarm to exchange peers with?
(Note: simple explanations are best.)
The Answer
SuperUser contributor Allquixotic offers an in depth explanation:
Not only did we learn the answer to the original question but we also learned quite a bit about the nature of the BitTorrent system and its vulnerabilities.
You can’t. It is impossible.*
* (Unless a node on your local area network happens to already be a node in the DHT. In this case, you could use a broadcasting mechanism, such as Avahi, to “discover” this peer, and bootstrap from them. But how did they bootstrap themselves? Eventually, you’ll hit a situation where you need to connect to the public Internet. And the public Internet is unicast-only, not multicast, so you’re stuck with using pre-determined lists of peers.)
References
Bittorrent DHT is implemented via a protocol known as Kademlia, which is a special case of theoretical concept of a Distributed hash table.
Exposition
With the Kademlia protocol, when you join the network, you go through a bootstrapping procedure, which absolutely requires that you know, in advance, the IP address and port of at least one node already participating in the DHT network. The tracker that you connect to, for instance, may be itself a DHT node. Once you are connected to one DHT node, you then proceed to download information from the DHT, which provides you connectivity information for more nodes, and you then navigate that “graph” structure to obtain connections to more and more nodes, who can provide both connectivity to other nodes, and payload data (chunks of the download).
I think your actual question in bold — that of how to join a Kademlia DHT network without knowing anyother members — is based on a false assumption.
The simple answer to your question in bold is, you don’t. If you do not know ANY information at all about even one host which might contain DHT metadata, you are stuck — you can’t even get started. I mean, sure, you could brute force attempt to discover an IP on the public internet with an open port that happens to broadcast DHT information. But more likely, your BT client is hard-coded to some specific static IP or DNS which resolves to a stable DHT node, which just provides the DHT metadata.
Basically, the DHT is only as decentralized as the joining mechanism, and because the joining mechanism is fairly brittle (there’s no way to “broadcast” over the entire Internet! so you have to unicastto an individual pre-assigned host to get the DHT data), Kademlia DHT isn’t really decentralized. Not in the strictest sense of the word.
Imagine this scenario: Someone who wants P2P to stop goes out and prepares an attack on all commonly used stable DHT nodes which are used for bootstrapping. Once they’ve staged their attack, they spring it on all nodes all at once. Wham; every single bootstrapping DHT node is down all in one fell swoop. Now what? You’re stuck with connecting to centralized trackers to download traditional lists of peers from those. Well, if they attack the trackers too, then you’re really, really up a creek. In other words, Kademlia and the entire BT network is constrained by the limitations of the Internet itself, in that, there is a finite (and relatively small) number of computers that you would have to successfully attack or take offline to prevent >90% of users from connecting to the network.
Once the “pseudo-centralized” bootstrapping nodes are all gone, the interior nodes of the DHT, which are not bootstrapping because nobody on the outside of the DHT knows about the interior nodes, are useless; they can’t bring new nodes into the DHT. So, as each interior node disconnects from the DHT over time, either due to people shutting down their computers, rebooting for updates, etc., the network would collapse.
Of course, to get around this, someone could deploy a patched BitTorrent client with a new list of pre-determined stable DHT nodes or DNS addresses, and loudly advertise to the P2P community to use this new list instead. But this would become a “whack-a-mole” situation where the aggressor (the node-eater) would progressively download these lists themselves, and target the brave new bootstrapping nodes, then take them offline, too.
Have something to add to the explanation? Sound off in the the comments. Want to read more answers from other tech-savvy Stack Exchange users? Check out the full discussion thread here.