Merkle, URLs, etc

Discussion:

Merkle, URLs, etc

Joseph Ashwood

2005-03-04 00:27:30 UTC

I apologize for my aparent falling off the earth act being an entrepreneur I
work when I have to, and I can only dedicate some free time to this.

Speed ratings for hashes. Wei Dai maintains benchmarks for all the
operations that Crypto++ supports available via
http://www.eskimo.com/~weidai/benchmarks.html . These benchmarks are:
32-bit machine
SHA-1 68 MB/sec
SHA-256 44 MB/sec
SHA-512 11 MB/sec

64-bit machine
SHA-1 100 MB/sec
SHA-256 58 MB/sec
SHA-512 93 MB/sec

While this clearly demonstrates the linear effect of moving to the larger
hashes, but in the current situation there is actually a second linear
problem, as the hash size grows the size of the data to be hashed at the
next level grows, leading to slow but exponential growth. The security
benefits though are more substantial and with SHA-1 broken there is plenty
of reason to move up.

URL formatting: It seems to me that for this it would be necessary for the
URL to include at the very least a tracker, and a root hash or filename.
Beyond that a tracker could supply the extra information. However this
potentially makes poor use of the available room in a URL, and as such the
URL should include a filesize, blocksize, and branching rate for the tree.

Merkle Trees: For verification of my suggestion I took some time and
implemented my proposed format (although I am not ready to call it finished
it does work). In the current implementation it will scale to 2^128 bytes of
original file size, but that should probably be shrunk some to save
overhead. I should say that it takes a huge amount of time build the tree
from a file, as I type I'm 29 minutes into processing a 180 MB file at 4kb
pieces ~ 95% done, and it occupies huge amounts of RAM, the test will occupy
about 4 times the storage space, but has 2 copies of the original file along
with the tree in memory at one point.

I had always assumed that there are two ways to properly address a node, by
the {file, nodeID (the {tree level, nodeOnLevel} pair)} and by the hash.
Alternately in order to expand the maximum size to infinite the nodes can be
addressed as {parentHash, childNumber}, actually this will only create an
addressable space of sqrt(output space of hash) nodes (by birthday paradox),
and so with SHA-256 the limit would be 2^128 nodes, but I don't consider
2^128 bytes to be too small by any measure, but if the gain of a factor of
nodeSize is important this can be considered. The advantage of addressing
purely based on the hash is simple, multiple torrents can be created that
share subtrees, or more plainly a multi-file torrent and a series of
singe-file torrents have the majority of the same nodes, assuming we are
willing to accept variable length nodes in the middle of the tree, and have
multiple nodes that claim to be root internally (multi-file torrent root +
for_each(file, file root)). Additionally, multi-torrent connections become
reasonable, especially when used with the MerklePool concept below.

Placing a piece of the tree in the open is difficult, the only thing that
can be dependably done (without completely monotonic serialization) is to
place the node in a pool of data. In a theoretical, but not practical sense
this can be used to build a pool of peers with data, even though they may
have no idea what to do with it except share. Realistically the usefulness
of this is that peer1 can feed peer2 pieces of the file that peer1 knows
peer2 will need (verified by descendancy from the internal/root node that
peer2 requested). This descendancy trait allows SIMD pipelining, that is,
one request results in multiple responses at the discretion of the serving
peer.

Binary Merkle Trees: Are heavily flawed, and the only suggestion made to fix
it was to send the grandchildren as well, which completely defeats the point
of the child nodes, making the child nodes pure overhead, they serve no
purpose except to enlarge the tranfer and raise the CPU time.

MerklePool: It occured to me during the implementation that in many ways the
most efficient way to implement the storage of the Merkle encoded files
(including the partials) is to store them in one central pool structure,
with search mechanisms used to find nodes by hash, this allows rare blocks
to be distributed widely, and for blocks to be multicast among systems that
should be interested at a later date.

Have maps: Actually with the MerkleTree since it is known that the root must
be retrieved first (at least be designation), there can be a child Have Map
associated with each request, that is specifically that when serving up an
internal node a bitmap of the immediate children is supplied as well. This
map will generally be only a few bytes and will be on the order of 0.2% of
the overall transfers. Also moving the requests for these maps to immediate
descendants of a node (or nodelist to reduce overhead) will have the same
effect.

This works because sending the client a list of haves for nodes they know
nothing about is pointless. The downside is that all requests are now based
on node instead of file location, something that can be troublesome.

Tracker to Tracker discussion: By doing this we're starting to approach what
Bram has referred to as a "gossip" network, as such a certain degree of
dependable, verifiable information has to be included. This is currently
done by only communicating it with trusted trackers. Later diversifying this
to include a cloud of trackers handling all the clients, this gets a bit
chaotic.

Instead what I would like to see is tracker redundancy. So where it is used
for clustering, this would still be the case, but all the peers would by
default connect to tracker1, if tracker1 is unavailable tracker2, etc. The
benefit of this keeping all the immediately used data on one system, so the
sync signals are minimal. It is only once the swarm reaches the size where a
single tracker cannot handle the load that distribution needs to take place.
With this size load, make a decision based on the swarm size, if the problem
is that the current tracker is overloaded, make a different tracker
tracker1. If the problem is the swarm itself it too big, split the swarm in
half and there is no reason to maintain a perfect list because the only
collisions will happen is trackerA goes down and all the clients default to
trackerB. The bulk of the information flow will be from trackerA to peers
during connection that the new default tracker (for the peer not
neccessarily for the whole swarm) is now trackerB. This load balancing while
rather primitive reduces the backend load on the trackers. This design is
based in large part of the experiences I have gleaned from others as well as
myself that dictate that a single BitTorrent tracker generally maxes out
around 90,000 peers, but this max-out is due to the lookups for the
peer-list, so the first bottleneck is actually the peer database, not the
other bandwidths. By clustering the trackers you will only bring down more
trackers at once in these lookups, but by splitting the peer database it
scales linearly (potentially super-linearly if the split is super-smart).

End-game overhead: I can see little possibility of improving this with a
purely random acquisition strategy. Moving to a method that is not
completely random does have advantages. So for example choosing rare pieces
first, and taking recommendations. So peers can recommend to each other an
order to grab descendants. This should be out of the main pipe and optional
both for send and recieve, but where implemented I can see no benefit to
suggesting badly to another peer due to the short lived preferencing in
BitTorrent. The result of this is that k peers will effectively team up to
get the file, and partially depend on between themselves transfers to finish
the end game.

This moves the end game problem but does not eliminate it. It reduces the
likelihood of not having a piece among team members, but the lack of benefit
actually drops off during end-game where tit-for-tat is no longer a concern.
This could also be implemented as a BEG command, that is the BEGgar is
asking for help in acquiring the piece listed, the BEGgee has the option of
helping in the search. It is also important that BEG misuse be detected, so
for example if badPeer uses BEG exclusively as a way of greedily getting
pieces, this can create substantial overhead for the BEGgees as they attempt
to track who has BEGged what, this is DOS detection. Used sparingly it can
be used to reduce the end-game overhead as BEG signals that a piece is rare
and should be acquired as quickly as possible, thus making it common.

THEX: The only time thex is a good thing is when you have a speech
impediment. THEX as a design lacks security, fails to make anything
resembling proper use of the available space (7/8 of the overhead bits are
wasted immediately), scales poorly and again the only suggestion to improve
it is to send the grandchildren as well which makes the children pure
overhead, having no redeeming value what so ever. THEX is wasteful,
pointless and insecure for BitTorrent..
Joe

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Olaf van der Spek

2005-03-04 11:49:51 UTC

Permalink

Post by Joseph Ashwood
Binary Merkle Trees: Are heavily flawed, and the only suggestion made to fix
it was to send the grandchildren as well, which completely defeats the point
of the child nodes, making the child nodes pure overhead, they serve no
purpose except to enlarge the tranfer and raise the CPU time.

A merkle tree with the same granularity as BT1 does have about the same
overhead.

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Justin Cormack

2005-03-04 12:56:25 UTC

Permalink

Post by Olaf van der Spek

A merkle tree with the same granularity as BT1 does have about the same
overhead.

But then its basically equivalent to your get info hash extension... Or
any other scheme where the first thing you do is get a full hash tree...

A pure hash-based scheme is quite nice, though whether it has real benefits
is less clear.

j

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

John Hoffman

2005-03-04 12:55:26 UTC

Permalink

Post by Joseph Ashwood
While this clearly demonstrates the linear effect of moving to the larger
hashes, but in the current situation there is actually a second linear
problem, as the hash size grows the size of the data to be hashed at the
next level grows, leading to slow but exponential growth. The security
benefits though are more substantial and with SHA-1 broken there is plenty
of reason to move up.

I'm not completely convinced it's necessary to switch from SHA-1. It's
been partially solved, but it's not THAT broken, especially not for
BitTorrent's purpose. We may also see some new 20-byte hashes that
don't have SHA's weakness. "160 bits ought to be enough for everyone." ;-)

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Joseph Ashwood

2005-03-05 01:41:21 UTC

Permalink

----- Original Message -----
From: "John Hoffman" <***@shambala.net>
Subject: Re: [BitTorrent] Merkle, URLs, etc

{SHA-1 is] not THAT broken, especially not for
BitTorrent's purpose. We may also see some new 20-byte hashes that
don't have SHA's weakness. "160 bits ought to be enough for everyone."
;-)

It is unlikely that there will be any good new 160 bit hashes. The 2^80
security offered by them has been considered just barely good enough for a
decade, and an increasing number of cryptanalysts (including myself) are
pushing that with the breaks seemingly coming faster and faster now we might
as well just jump to the stronger hashes now, and prepare for a quick jump
in the future. If you have your heart set on a 160 bit hash though take a
look toward RIPEMD-160, it is the only other 160-bit hash that has been
beaten on by cryptanalysts, then look to Tiger, and SHA-2 series. You could
also simply take a larger hash and trim the output, so trimming SHA-1 to 138
bits would balance the output with the most recent attack (although there is
a chance it will open up new attacks, but that is highly unlikely). While
everyone has their favorite method of doing this (mine if to drop the last
bits) the absolute truth is that if the hash is strong, nothing you can do
to the output will weaken it (except by the amount shortened) so feel free
to use 1M time Whirlpool (using the previous output as an IV for the new
hash), XOR all the hashes together, and then trim off bits, you won't affect
the strength of the hash for good or bad.
Joe

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

John Hoffman

2005-03-07 00:52:41 UTC

Permalink

Post by Joseph Ashwood

{SHA-1 is] not THAT broken, especially not for
BitTorrent's purpose. We may also see some new 20-byte hashes that
don't have SHA's weakness. "160 bits ought to be enough for everyone."
;-)

Well, please remember that, when it comes to BitTorrent not only is it
not 2^80 but 2^160 security, but BitTorrent has a fixed block size where
many of these attacks require variable block sizes, and I don't think
the new attack is even relevant to SHA-1's strong anti-collision
capability. It might cause havoc with other systems needing signatures,
but I don't think it affects BT any.

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Joseph Ashwood

2005-03-07 03:05:44 UTC

Permalink

----- Original Message -----
From: "John Hoffman" <***@shambala.net>
Subject: Re: [BitTorrent] Merkle, URLs, etc

Post by John Hoffman

Post by Joseph Ashwood
It is unlikely that there will be any good new 160 bit hashes. The 2^80
security offered by them has been considered just barely good enough for a
decade, and an increasing number of cryptanalysts (including myself) are
pushing that with the breaks seemingly coming faster and faster now we might
as well just jump to the stronger hashes now, and prepare for a quick jump
in the future.

That's all well and good, but cryptanalysts still are unlikely to work on
something that will be percieved as a waste of time, something, to be rather
pointed, like a 160-bit hash function.

Any attack always affects every system, although I freely admit that the
hash is not the weakest link in BitTorrent security. That the current SHA-1
attack only finds free collisions is fairly irrelevant, attacks never get
worse. SHA-1 is probably good enough for now, but plans need to be put in
place to retire it and move to something better. It is always good to
prepare to simply roll over to a new primitive at a moments notice. The
method of doing this that I am in favor of is to prepend a primitive
identifier to all hashes, in the case of BitTorrent use descendency to save
time and simply put the hash identifier in the torrent, and implement
identifiers and code for SHA-245, SHA-512 and Whirlpool. This means that any
users who want higher security can already roll over to the stronger
functions right now, and at a later date when it becomes important to do so,
the entire system can roll over to a new hash quickly, even silently. This
is just good design, and will cost all of 15 bytes (maximum assuming I'm
counting correctly) once bencoded.
Joe

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Bill Cox

2005-03-08 17:44:26 UTC

Permalink

Post by Joseph Ashwood
----- Original Message -----
Subject: Re: [BitTorrent] Merkle, URLs, etc

...

Post by Joseph Ashwood
Any attack always affects every system, although I freely admit that the
hash is not the weakest link in BitTorrent security. That the current SHA-1
attack only finds free collisions is fairly irrelevant, attacks never get
worse. SHA-1 is probably good enough for now, but plans need to be put in
place to retire it and move to something better. It is always good to
prepare to simply roll over to a new primitive at a moments notice. The
method of doing this that I am in favor of is to prepend a primitive
identifier to all hashes, in the case of BitTorrent use descendency to save
time and simply put the hash identifier in the torrent, and implement
identifiers and code for SHA-245, SHA-512 and Whirlpool. This means that any
users who want higher security can already roll over to the stronger
functions right now, and at a later date when it becomes important to do so,
the entire system can roll over to a new hash quickly, even silently. This
is just good design, and will cost all of 15 bytes (maximum assuming I'm
counting correctly) once bencoded.
Joe

I agree. There isn't any agreed method for extending the BT protcol,
but there are several un-official extensions floating around (like
Olaf's Merkle trees).

So, to be more exact, you'd add a new key to the .torrent files in the
'info' dictionary. How about calling it 'hash_function'? How about
restricting legal values for now to "SHA-1", "SHA-256" or "SHA-512"?
The nice thing about that is that we probably all already have SHA-1
through SHA-512 linked into existing clients. If there's a reason,
Whirlpool, or any other function, could be added to the list later.

Bill

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Joseph Ashwood

2005-03-09 01:50:04 UTC

Permalink

----- Original Message -----
From: "Bill Cox" <***@viasic.com>
Subject: Re: [BitTorrent] Merkle, URLs, etc

[Joe: Add new key to torrent]

Post by Bill Cox
So, to be more exact, you'd add a new key to the .torrent files in the
'info' dictionary.

Not sure if it must go there, but generally yes.

Post by Bill Cox
How about calling it 'hash_function'?

Seems perfectly reasonable.

Post by Bill Cox
How about
restricting legal values for now to "SHA-1", "SHA-256" or "SHA-512"?

I'd call them "supported" or "standard" values, but basically yeah.

Post by Bill Cox
The nice thing about that is that we probably all already have SHA-1
through SHA-512 linked into existing clients. If there's a reason,
Whirlpool, or any other function, could be added to the list later.

If not I have a SHA-256 and a SHA-512 implementation in C that are public
domain (acquired from coderpunks list via anonymous remailer about 4 years
ago).
Joe

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Olaf van der Spek

2005-03-04 16:12:05 UTC

Permalink

Post by Olaf van der Spek

A merkle tree with the same granularity as BT1 does have about the same
overhead.

But then its basically equivalent to your get info hash extension... Or

You mean get_info/info?
No, it's not.

any other scheme where the first thing you do is get a full hash tree...

No, because you do not get the entire tree up front.

A pure hash-based scheme is quite nice, though whether it has real benefits
is less clear.
j
Yahoo! Groups Links

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Justin Cormack

2005-03-04 16:10:40 UTC

Permalink

Post by Olaf van der Spek

A merkle tree with the same granularity as BT1 does have about the same
overhead.

But then its basically equivalent to your get info hash extension... Or

You mean get_info/info?
No, it's not.

any other scheme where the first thing you do is get a full hash tree...

No, because you do not get the entire tree up front.

If its flat you have to get it up front, you cant do anything else.

j

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Olaf van der Spek

2005-03-04 16:29:17 UTC

Permalink

Post by Justin Cormack

Post by Olaf van der Spek

A merkle tree with the same granularity as BT1 does have about the same
overhead.

But then its basically equivalent to your get info hash extension... Or

You mean get_info/info?
No, it's not.

any other scheme where the first thing you do is get a full hash tree...

No, because you do not get the entire tree up front.

If its flat you have to get it up front, you cant do anything else.

Who said it's flat?

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/

Justin Cormack

2005-03-04 16:38:37 UTC

Permalink

Post by Olaf van der Spek

Post by Justin Cormack

Post by Olaf van der Spek

A merkle tree with the same granularity as BT1 does have about the same
overhead.

But then its basically equivalent to your get info hash extension... Or

You mean get_info/info?
No, it's not.

any other scheme where the first thing you do is get a full hash tree...

No, because you do not get the entire tree up front.

If its flat you have to get it up front, you cant do anything else.

Who said it's flat?

Sorry, misinterpreted your earlier comment. Ignore it.

Yahoo! Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/BitTorrent/

<*> To unsubscribe from this group, send an email to:
BitTorrent-***@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/