<?xml version="1.0" encoding="UTF-8"?><discussion><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Sun Jun 28 15:16:20 +0200 2009</updated_at><permalink>anrfnq</permalink><title>Caveats of Evaluating Databases</title><lines><line><color></color><content><![CDATA[ This is part two in a small series about measuring software performance. There's a lot of common sense covered, but I feel it necessary to shed some light.]]></content><position>1</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>2</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ If you haven't, check out [part one][dir].]]></content><position>3</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>4</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [dir]: http://jan.prima.de/~jan/plok/archives/175-Benchmarks-You-are-Doing-it-Wrong.html]]></content><position>5</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>6</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ * * *]]></content><position>7</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>8</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Say you want to find out what's behind the buzz of all these new [#nosql][nosql] databases. There's a large number to choose from today. All options come in varying degrees of maturity and characteristics so it'd be nice to know what solves your problem best. A non inclusive list of these databases or storage systems include [Memcache][mc][[DB]][mcd], [Tokyo Cabinet][tc] / [Tyrant][tt], [Project Voldemort][pv], [Scalaris][sc], [Dynamite][dy], [Redis][red], [Persevere][psvr], [MongoDB][mo], [Solr][solr] or my favourite [CouchDB][cdb]. And these are just some of the open source ones.]]></content><position>9</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Wed Jun 17 00:01:16 +0200 2009</updated_at><version>2</version><user_name></user_name><comments><comment><created_at>Wed Jun 17 00:01:16 +0200 2009</created_at><updated_at>Wed Jun 17 00:01:16 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Say you want to find out what's behind the buzz of all these new [#nosql][nosql] databases. There's a large number to choose from today. All options come in varying degrees of maturity and characteristics so it'd be nice to know what solves your problem best. A non inclusive list of these databases or storage systems include [Memcache][mc][[DB]][mcd], <del class="diffdel">[BerkleyDB][bdb], </del>[Tokyo Cabinet][tc] / [Tyrant][tt], [Project Voldemort][pv], [Scalaris][sc], [Dynamite][dy], [Redis][red], [Persevere][psvr], [MongoDB][mo], [Solr][solr] or my favourite [CouchDB][cdb]. And these are just some of the open source ones.]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 16:53:06 +0200 2009</created_at><updated_at>Tue Jun 16 16:53:06 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ bdb + new = ;-)]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>10</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [cdb]: http://couchdb.apache.org/]]></content><position>11</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [solr]: http://lucene.apache.org/solr/]]></content><position>12</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [mo]: http://www.mongodb.org/]]></content><position>13</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [psvr]: http://www.persvr.org/]]></content><position>14</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [red]: http://code.google.com/p/redis/]]></content><position>15</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [dy]: http://github.com/cliffmoon/dynomite/tree/master]]></content><position>16</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [sc]: http://code.google.com/p/scalaris/]]></content><position>17</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [pv]: http://project-voldemort.com/]]></content><position>18</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [tt]: http://tokyocabinet.sourceforge.net/tyrantdoc/]]></content><position>19</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [tc]: http://tokyocabinet.sourceforge.net/]]></content><position>20</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [bdb]: http://www.oracle.com/technology/products/berkeley-db/index.html]]></content><position>21</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [mc]: http://www.danga.com/memcached/]]></content><position>22</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [mcd]: http://memcachedb.org/]]></content><position>23</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>24</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [nosql]: http://blog.oskarsson.nu/2009/06/nosql-debrief.html]]></content><position>25</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>26</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ This article is *not* a comprehensive comparison of any of the mentioned systems. Instead it tries to give you an idea about what to look for when evaluating a storage system	or how to take into perspective evaluations and benchmarks others have done.]]></content><position>27</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>28</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### Astounding Numbers]]></content><position>29</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>30</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ From time to time you see some crazy numbers posted to the reddit of the internet that claim fantastic performance. ]]></content><position>31</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Sun Jun 28 15:16:20 +0200 2009</updated_at><version>2</version><user_name>Lalit</user_name><comments><comment><created_at>Sun Jun 28 15:16:20 +0200 2009</created_at><updated_at>Sun Jun 28 15:16:20 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> From time to time you see some crazy numbers posted to the <del class="diffmod">reddits </del><ins class="diffmod">reddit </ins>of the <del class="diffmod">internets </del><ins class="diffmod">internet </ins>that claim fantastic performance. ]]></content><user_name>Lalit</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>32</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ >“The (imaginary) SuperfastDB can store 450,000 items per second!”.]]></content><position>33</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>34</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Wow.]]></content><position>35</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>36</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ No word on where the items are stored (in memory? on a harddrive? Spindles? Solid State?), what an *item* is exactly and how big it is, the rest of the hardware this was run on and how to reproduce it.]]></content><position>37</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>38</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ But boy, 450,000 a second!]]></content><position>39</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>40</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ My shoes can do 650,000 a second, but you've got to figure out what.]]></content><position>41</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 15:02:23 +0200 2009</created_at><updated_at>Tue Jun 16 15:02:23 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ lol]]></content><user_name>langalex</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>42</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:17 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Context is as important as reproducibility. The [last article][last] here established that finding out that my system and your system come up with different numbers is not much of a help. Any sort of serious test must come with a set of scripts or programs and comprehensive instructions on how the tests were run.]]></content><position>43</position><created_at>Tue Jun 16 14:11:17 +0200 2009</created_at><updated_at>Tue Jun 16 18:12:39 +0200 2009</updated_at><version>2</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:12:39 +0200 2009</created_at><updated_at>Tue Jun 16 18:12:39 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Context is as important as reproducibility. The <del class="diffmod">last article </del><ins class="diffmod">[last article][last] </ins>here established that finding out that my system and your system come up with different numbers is not much of a help. Any sort of serious test must come with a set of scripts or programs and comprehensive instructions on how the tests were run.]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 17:27:13 +0200 2009</created_at><updated_at>Tue Jun 16 17:27:13 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Needless to say that different numbers on different machines have to be expected]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ [last]: http://jan.prima.de/~jan/plok/archives/175-Benchmarks-You-are-Doing-it-Wrong.html#variables]]></content><position>44</position><created_at>Tue Jun 16 18:12:38 +0200 2009</created_at><updated_at>Tue Jun 16 18:12:39 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:12:39 +0200 2009</created_at><updated_at>Tue Jun 16 18:12:39 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> added new line]]></content><user_name>jan</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>45</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ * * *]]></content><position>46</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>47</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Everything “cool” in computer science has been around for 25+ years. Actual innovation is rare. Advancements in hardware and new combinations of existing solutions make for new stuff coming out each day (that's a good thing), but the fundamental rules are the same for all. We're all running [von Neumann][vn] machines, [quicksort][qs] is still pretty quick and [hashes][has] and [b-trees][btr] rule the storage world.]]></content><position>48</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>49</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [vn]: http://en.wikipedia.org/wiki/Von_Neumann_architecture]]></content><position>50</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Let's recap.]]></content><position>51</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [qs]: http://en.wikipedia.org/wiki/Quicksort]]></content><position>52</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [has]: http://en.wikipedia.org/wiki/Hash_function]]></content><position>53</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [btr]: http://en.wikipedia.org/wiki/B-tree]]></content><position>54</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>55</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### Hashes & Trees]]></content><position>56</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>57</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Hashing revolves around the idea of [`O(1)`][o] lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you O(N)` operations that you can't ignore in practice.]]></content><position>58</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 18:13:22 +0200 2009</updated_at><version>3</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:13:23 +0200 2009</created_at><updated_at>Tue Jun 16 18:13:23 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Hashing revolves around the idea of [`O(1)`][o] lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you <del class="diffmod">some non-`O(1)` </del><ins class="diffmod">O(N)` </ins>operations that you can't ignore in practice.]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:13:23 +0200 2009</created_at><updated_at>Tue Jun 16 18:13:23 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Hashing revolves around the idea of [`O(1)`][o] lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you <del class="diffmod">some non-`O(1)` </del><ins class="diffmod">O(N)` </ins>operations that you can't ignore in practice.]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:13:22 +0200 2009</created_at><updated_at>Tue Jun 16 18:13:22 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Hashing revolves around the idea of [`O(1)`][o] lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you <del class="diffmod">some non-`O(1)` </del><ins class="diffmod">O(N)` </ins>operations that you can't ignore in practice.]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:13:22 +0200 2009</created_at><updated_at>Tue Jun 16 18:13:22 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Hashing revolves around the idea of [`O(1)`][o] lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some more work is required which gives you <del class="diffmod">some non-`O(1)` </del><ins class="diffmod">O(N)` </ins>operations that you can't ignore in practice.]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 17:31:13 +0200 2009</created_at><updated_at>Tue Jun 16 17:31:13 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Maybe you want to add that hashing in worst case is a O(n) operation]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 16:57:26 +0200 2009</created_at><updated_at>Tue Jun 16 16:57:26 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ It may be interesting to describe that rebalancing is *hard*]]></content><user_name>Andres</user_name></comment><comment><created_at>Tue Jun 16 16:55:31 +0200 2009</created_at><updated_at>Tue Jun 16 16:55:31 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Letzter Satz hört sich noch ein wenig sonderbar an.]]></content><user_name>Andres</user_name></comment><comment><created_at>Tue Jun 16 16:55:22 +0200 2009</created_at><updated_at>Tue Jun 16 16:55:22 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Hashing revolves around the idea of [`O(1)`][o] lookups. Allocate a number of buckets, create a function that gives you a number of a bucket for any data item you might want to store, make sure no two data items hit the same bucket (or work around that). Runtime characteristics include that you only need to ask your function where to look for or store your data and the allocation of your set of buckets: If you need to store more items than you have buckets, some <ins class="diffins">more </ins>work is required which gives you some non-`O(1)` operations that you can't ignore in practice.]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>59</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ <div style="text-align:center;"><a href="http://en.wikipedia.org/wiki/File:Hash_table_4_1_0_0_0_0_0_LL.svg"><img src="http://jan.prima.de/~jan/plok/uploads/D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></a></div>]]></content><position>60</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 17:31:34 +0200 2009</updated_at><version>2</version><user_name>CK</user_name><comments><comment><created_at>Tue Jun 16 17:31:34 +0200 2009</created_at><updated_at>Tue Jun 16 17:31:34 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://en.wikipedia.org/wiki/File:Hash_table_4_1_0_0_0_0_0_LL.svg"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></ins></a></div>]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:31:34 +0200 2009</created_at><updated_at>Tue Jun 16 17:31:34 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://en.wikipedia.org/wiki/File:Hash_table_4_1_0_0_0_0_0_LL.svg"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></ins></a></div>]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:31:34 +0200 2009</created_at><updated_at>Tue Jun 16 17:31:34 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://en.wikipedia.org/wiki/File:Hash_table_4_1_0_0_0_0_0_LL.svg"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></ins></a></div>]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:31:34 +0200 2009</created_at><updated_at>Tue Jun 16 17:31:34 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://en.wikipedia.org/wiki/File:Hash_table_4_1_0_0_0_0_0_LL.svg"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" alt="D5563B63-7B48-4280-A31F-EDB37DB78416.jpg" border="0" width="240" height="168" /></ins></a></div>]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>61</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [o]: http://en.wikipedia.org/wiki/Big_O_notation]]></content><position>62</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>63</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ The other elephant in the room are [b-trees][btr]. The fundamental idea here is to get to your data in a minimal number of steps traversing a tree because making a step is expensive, but reading your data is very fast comparatively. Steps are expensive because they translate to a head seek (that is the time your spinning hard drive needs to position the reading arm to find the spot to read your data from), but reading from a harddrive once the reading head is in place is fast.]]></content><position>64</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 15:07:17 +0200 2009</updated_at><version>3</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 16:56:55 +0200 2009</created_at><updated_at>Tue Jun 16 16:56:55 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Complexity is missing - if you refer to it on one place...]]></content><user_name>Andres</user_name></comment><comment><created_at>Tue Jun 16 16:49:34 +0200 2009</created_at><updated_at>Tue Jun 16 16:49:34 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ more on what a step is and btrees in general, please]]></content><user_name>chris</user_name></comment><comment><created_at>Tue Jun 16 15:07:17 +0200 2009</created_at><updated_at>Tue Jun 16 15:07:17 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> The other elephant in the room are [b-trees][btr]. The fundamental idea here is to get to your data in a minimal number of steps traversing a tree because making a step is expensive, but reading your data is very fast comparatively. Steps are expensive because they translate to a head seek (that is the time your spinning hard drive needs to position the reading arm <del class="diffmod">that hovers over the plate finds </del><ins class="diffmod">to find </ins>the spot to read your data from), but reading from a harddrive once the reading head is in place is fast.]]></content><user_name>langalex</user_name></comment><comment><created_at>Tue Jun 16 15:07:17 +0200 2009</created_at><updated_at>Tue Jun 16 15:07:17 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> The other elephant in the room are [b-trees][btr]. The fundamental idea here is to get to your data in a minimal number of steps traversing a tree because making a step is expensive, but reading your data is very fast comparatively. Steps are expensive because they translate to a head seek (that is the time your spinning hard drive needs to position the reading arm <del class="diffmod">that hovers over the plate finds </del><ins class="diffmod">to find </ins>the spot to read your data from), but reading from a harddrive once the reading head is in place is fast.]]></content><user_name>langalex</user_name></comment><comment><created_at>Tue Jun 16 15:04:54 +0200 2009</created_at><updated_at>Tue Jun 16 15:04:54 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> The other elephant in the room are [b-trees][btr]. The fundamental idea here is to get to your data in a minimal number of steps traversing a tree because making a step is expensive, but reading your data is very fast comparatively. Steps are expensive because <del class="diffmod">the </del><ins class="diffmod">they </ins>translate to a head seek (that is the time your spinning hard drive needs to position the reading arm that hovers over the plate finds the spot to read your data from), but reading from a harddrive once the reading head is in place is fast.]]></content><user_name>langalex</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>65</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ <div style="text-align:center;"><a href="http://www.scholarpedia.org/article/Image:Btree-Insert-Fig2.GIF"><img src="http://jan.prima.de/~jan/plok/uploads/6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></a></div>]]></content><position>66</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 17:32:46 +0200 2009</updated_at><version>2</version><user_name>CK</user_name><comments><comment><created_at>Tue Jun 16 17:32:46 +0200 2009</created_at><updated_at>Tue Jun 16 17:32:46 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://www.scholarpedia.org/article/Image:Btree-Insert-Fig2.GIF"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></ins></a></div>]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:32:46 +0200 2009</created_at><updated_at>Tue Jun 16 17:32:46 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://www.scholarpedia.org/article/Image:Btree-Insert-Fig2.GIF"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></ins></a></div>]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:32:46 +0200 2009</created_at><updated_at>Tue Jun 16 17:32:46 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://www.scholarpedia.org/article/Image:Btree-Insert-Fig2.GIF"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></ins></a></div>]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:32:46 +0200 2009</created_at><updated_at>Tue Jun 16 17:32:46 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><a href="http://www.scholarpedia.org/article/Image:Btree-Insert-Fig2.GIF"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" alt="6720EE64-4DFC-4298-B3BA-0145746C6523.jpg" border="0" width="400" height="158" /></ins></a></div>]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>67</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ There are a bunch of more interesting lookup structure like [R-Trees][r] for spacial queries, but they are mostly used for secondary indexes on top a regular data set that lives in a hash or b-tree.]]></content><position>68</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>69</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [r]: http://en.wikipedia.org/wiki/R-tree]]></content><position>70</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [o]: http://o]]></content><position>71</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>72</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### Concurrency vs. Speed]]></content><position>73</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>74</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Concurrency is hard. The devil lies in the details and when briefly looking at things, the details are often overlooked. Suites the devil.]]></content><position>75</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>76</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Creating storage systems that assume only one access occurs at a time is relatively easy. If resources are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).]]></content><position>77</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 17:34:12 +0200 2009</updated_at><version>2</version><user_name>CK</user_name><comments><comment><created_at>Tue Jun 16 17:34:13 +0200 2009</created_at><updated_at>Tue Jun 16 17:34:13 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Creating storage systems that assume only one access occurs at a time is relatively easy. If <del class="diffmod">resource </del><ins class="diffmod">resources </ins>are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:34:13 +0200 2009</created_at><updated_at>Tue Jun 16 17:34:13 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Creating storage systems that assume only one access occurs at a time is relatively easy. If <del class="diffmod">resource </del><ins class="diffmod">resources </ins>are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:34:13 +0200 2009</created_at><updated_at>Tue Jun 16 17:34:13 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Creating storage systems that assume only one access occurs at a time is relatively easy. If <del class="diffmod">resource </del><ins class="diffmod">resources </ins>are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:34:12 +0200 2009</created_at><updated_at>Tue Jun 16 17:34:12 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Creating storage systems that assume only one access occurs at a time is relatively easy. If <del class="diffmod">resource </del><ins class="diffmod">resources </ins>are shared concurrently, things become tricky. The two larger schools of thought (and practice) are locking and no-locking (heh).]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>78</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Locking means that the database has to maintain information for everybody who wants to write to a part of the database, and what part it is.]]></content><position>79</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 17:37:49 +0200 2009</created_at><updated_at>Tue Jun 16 17:37:49 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Not really satisfied with this paragraph. The DBMS has to do a lot more than maintain the information, e.g. maintain access (shared locking vs. exclusive locking), queueing the readers/writers, etc, pp. Not sure if one would write about it in a blog entry like this.]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>80</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ No locking, or [optimistic locking][ol] or [MVCC][] moves that burden to the person who is trying to write to the database. She must prove that she won't be overwriting any existing data.]]></content><position>81</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>82</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [MVCC]: http://en.wikipedia.org/wiki/Multiversion_concurrency_control]]></content><position>83</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [ol]: http://en.wikipedia.org/wiki/Optimistic_concurrency_control]]></content><position>84</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>85</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ The [trade-offs][troff] here are a leaner request handing on the server that works well with remote & concurrent clients at the expense of more complexity on the client (the person who wants to store something in our database).]]></content><position>86</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>87</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Hybrid approaches are possible too: While MVCC is used internally, the database's clients can rely on database-side locking (e.g. [PostgreSQL][post] or  [InnoDB][inno]).]]></content><position>88</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 18:14:18 +0200 2009</updated_at><version>2</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:14:18 +0200 2009</created_at><updated_at>Tue Jun 16 18:14:18 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Hybrid approaches are possible too: While MVCC is used internally, the database's clients can rely on database-side locking (e.g. <ins class="diffins">[PostgreSQL][post] or  </ins>[InnoDB][inno]).]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 16:58:36 +0200 2009</created_at><updated_at>Tue Jun 16 16:58:36 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ As usual, I have to say, that postgres did this a long time before innodb...]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ [post]: http://postgresql.org/]]></content><position>89</position><created_at>Tue Jun 16 18:14:18 +0200 2009</created_at><updated_at>Tue Jun 16 18:14:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:14:18 +0200 2009</created_at><updated_at>Tue Jun 16 18:14:18 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> added new line]]></content><user_name>jan</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>90</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [inno]: http://www.mysqlperformanceblog.com/?s=innodb+mvcc+storage]]></content><position>91</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [troff]: http://jan.prima.de/~jan/plok/archives/175-Benchmarks-You-are-Doing-it-Wrong.html#trade-offs]]></content><position>92</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>93</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### Networks]]></content><position>94</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>95</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Just a quick note: We already talk about *client* and *server* here. There is a strong case for embedded databases like [SQLite][sqlite] that don't expose a concurrent user model to the outside. The program that needs an embedded database just includes it.]]></content><position>96</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>97</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [sqlite]: http://www.sqlite.org/]]></content><position>98</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>99</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Another approach to using databases is having a dedicated computer running a database system and sharing it over the network with any number of *clients* using this database *server*. They can often be “a bunch of servers” or a *cluster*. More on that later.]]></content><position>100</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>101</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ A separate database server (networked or not) will need to spend some time to deal with connections, network failures, unspecified client behaviour and so on. The upside is a piece of infrastructure that can be maintained separately. An embedded database will thus be faster but probably won't solve all of your problems and it will always be tied to your application.]]></content><position>102</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>103</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>104</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### fsync(): Reliability vs. Speed]]></content><position>105</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>106</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ When people tell me “SuperfastDB does 450,000 a second!” I ask “How many `fsync()`s is that?”. Let me explain:]]></content><position>107</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>108</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ A database system uses operating system services to use any hardware. The operating systems exposes a harddrive through a filesystem. The database systems talks to the filesystem and asks it to store or retrieve data in its behalf. The filesystem then goes ahead and tries to satisfy the database's requests. ]]></content><position>109</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>110</position><created_at>Tue Jun 16 14:11:18 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:18 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ (I'll not talk about databases that can use raw block devices to store data. They exist but they are not as common as those who use the filsystem.)]]></content><position>111</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 17:43:10 +0200 2009</created_at><updated_at>Tue Jun 16 17:43:10 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ DB2 can, about oracle I don't know]]></content><user_name>CK</user_name></comment><comment><created_at>Tue Jun 16 17:00:32 +0200 2009</created_at><updated_at>Tue Jun 16 17:00:32 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Oracle?]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>112</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ The filesystem also tries to be clever – for good reasons. When the database requests a piece of data, the filesystem will not only find that piece and return it, it will also store it in a *cache* to avoid having to actually talk to the harddrive the next time this piece of data gets requested. When the data changes, the filesystem either removes it from the cache or updates it with the harddrive. It might even go further and only store the new data that comes in with a write request into the cache and rely on a periodic task to write all of the cache back to the drive. Writing a bunch of of pieces at once is more efficient than storing each one on its own.]]></content><position>113</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 17:08:07 +0200 2009</created_at><updated_at>Tue Jun 16 17:08:07 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Well, at all caching to avoid disk operations when reading the written piece of data is not all. Filesystems (and even hard disks!) cache data to write larger chunks of data. Writing data physically to hard disk is very expensive, so writing one larger chunk of data is much more efficient than writing a lot of small chunks.]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>114</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ More efficient equals to faster and faster is good, right? Well, it depends: If all goes well, this approach is a nice one. But you know computers, things will not go well 100% of the time. The failure scenarios are endless, but they boil down to the question: “What happens when your machine dies and you have data that has only been written to memory?” — The answer isn't too hard: That data is lost. If there is a delay between a write request finishing and data being written (or “flushed”) to disk any data that has been “written” during the delay period is subject to lost.]]></content><position>115</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>116</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ There are cases where this is not a problem; in other cases it is. A developer should have the chance to decide. (Note that even your hardware could be lying to you about having stored data, but I'll punt on this one, get proper hardware).]]></content><position>117</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>118</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ So, flushing to disk needs to happen before you can rest assured your data has been stored. Your operating system has an API call that forces the filesystem to write its cache to disk. It is called `fsync()` (on UNIX systems) and it is an expensive operation. You can only do so many `fsync()`s in a second and it is not a great many.]]></content><position>119</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 17:11:21 +0200 2009</created_at><updated_at>Tue Jun 16 17:11:21 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Also it might be interesting that minimizing fsync()'s might loose of importance if server-grade flash drives get more common/payable]]></content><user_name>Andres</user_name></comment><comment><created_at>Tue Jun 16 17:02:46 +0200 2009</created_at><updated_at>Tue Jun 16 17:02:46 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Perhaps mention that an estimated upper bound is (rpm/60)*nr_disks]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>120</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ The 450,000 items were most likely just written to memory and not to disk.]]></content><position>121</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>122</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### Space & In-Place]]></content><position>123</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>124</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ When writing files to disk (at the end of the day, your data ends up in one file or another on the filesystem) that represents what lives in a database, there are multiple options to handle *updates*.]]></content><position>125</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>126</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ An update is a change to your data item, for example, a new phone number. The intuitive way to handle this is to go and find the old phone number in the file, and overwrite it with the new number. Easy.]]></content><position>127</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>128</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ There are several problems with this approach: What to do if the new phone number is longer than the old one (say you added an international calling prefix)? The new number needs to be written to a different place and the change in location must be recorded. Not too big of an issue.]]></content><position>129</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>130</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Back to failure scenarios: Again, the reasons can be manifold, but what happens when we've (over-)written the first 4 digits of the old with the new number and then the server dies, power goes away or the database server crashes? The next time you want to read the phone number you get a mix of the old and the new one (if you are lucky) and you don't exactly know that this is the case and which parts are missing. Your database file is *inconsistent* and you need to run a *integrity check* to find missing bits and correct half-written bytes. In the worst case that means scanning your entire database file a few times before you resolved all inconstancies. If you have a lot of data, that can take *days*.]]></content><position>131</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 18:16:23 +0200 2009</updated_at><version>2</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:16:23 +0200 2009</created_at><updated_at>Tue Jun 16 18:16:23 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Back to failure scenarios: Again, the reasons can be manifold, but what happens when we've (over-)written the first 4 digits of the old with the new number and then the server dies, power goes away or the database server crashes? The next time you want to read the phone number you get a mix of the old and the new one (if you are lucky) and you don't exactly know that this is the case and which parts are missing. Your database file is *inconsistent* and you need to run a *integrity check* to find missing bits and correct half-written bytes. In the worst case that means scanning your entire database file a few times before you resolved all inconstancies. If you have a lot of data, that <del class="diffmod">takes a while.</del><ins class="diffmod">can take *days*.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:16:23 +0200 2009</created_at><updated_at>Tue Jun 16 18:16:23 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Back to failure scenarios: Again, the reasons can be manifold, but what happens when we've (over-)written the first 4 digits of the old with the new number and then the server dies, power goes away or the database server crashes? The next time you want to read the phone number you get a mix of the old and the new one (if you are lucky) and you don't exactly know that this is the case and which parts are missing. Your database file is *inconsistent* and you need to run a *integrity check* to find missing bits and correct half-written bytes. In the worst case that means scanning your entire database file a few times before you resolved all inconstancies. If you have a lot of data, that <del class="diffmod">takes a while.</del><ins class="diffmod">can take *days*.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 16:53:00 +0200 2009</created_at><updated_at>Tue Jun 16 16:53:00 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ it can take DAYS]]></content><user_name>chris</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>132</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ To solve this, you always write the new phone number to a new place in the database file and only when it has been `fsync()`ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity check.]]></content><position>133</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 18:16:05 +0200 2009</updated_at><version>3</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:16:05 +0200 2009</created_at><updated_at>Tue Jun 16 18:16:05 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> To solve this, you always write the new phone number to a new place in the database file and only when it has been `fsync()`ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity <del class="diffmod">check that could take days.</del><ins class="diffmod">check.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:16:05 +0200 2009</created_at><updated_at>Tue Jun 16 18:16:05 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> To solve this, you always write the new phone number to a new place in the database file and only when it has been `fsync()`ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity <del class="diffmod">check that could take days.</del><ins class="diffmod">check.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:15:55 +0200 2009</created_at><updated_at>Tue Jun 16 18:15:55 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> To solve this, you always write the new phone number to a new place in the database file and only when it has been `fsync()`ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity <del class="diffmod">check.</del><ins class="diffmod">check that could take days.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:15:55 +0200 2009</created_at><updated_at>Tue Jun 16 18:15:55 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> To solve this, you always write the new phone number to a new place in the database file and only when it has been `fsync()`ed to disk, you update the location of the phone number (and then flush that update to disk as well). You will never end up in a scenario where your database file can end up an inconsistent state and after a crash you are back online without an integrity <del class="diffmod">check.</del><ins class="diffmod">check that could take days.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 17:04:23 +0200 2009</created_at><updated_at>Tue Jun 16 17:04:23 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Without an integrity check, but with a replay of some sort. (undo, redo, whatever)]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>134</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ The trade-off for consistency is write-speed (remember `fsync()`s are expensive) for consistency-check-speed after a failure.]]></content><position>135</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>136</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ A nice bonus is that if the “new place in the database” is the end of the file, you keep your disk-drive head busy with writing data to disk instead of seeking all over the place (remember: seeks are expensive).]]></content><position>137</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>138</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>139</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### Distribution, Sharding & Resharding]]></content><position>140</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>141</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ So far, we've been looking at scenarios that involve a single database. We learned a great deal (I hope), but in reality we often deal with more than one database. The simplest reason to have two databases is for redundancy. Failures can bring down your database temporarily or even permanently. If it is a temporary issue, waiting a bit (or a bit longer) to get up and running again might be an option, but often, an application or service should be available at all times. A fatal failure where a database server is lost beyond repair, your data is gone if you haven't stored it in a second place.]]></content><position>142</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>143</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ “I'll just make two copies, easy!”. Yup easy, until you look at the details (that damn devil again!). ]]></content><position>144</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>145</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ It's all about failures again. Consider a single read request. A client connects to a server and asks for a data item. The server looks it up and returns the data to the client. All is well. At any point things can go wrong. The network connection can drop (or slow down so much that client or server assume it dropped), the client can disappear (because of a network failure or crash) as can the server. Clients, servers and the protocols they speak need to be built around the assumption that any of these things (and many more) can go wrong. If any parts is not designed to handle error cases, your system will do funny things, but it won't reliably store and manage your data.]]></content><position>146</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>147</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Add complexity: With each write target (store in two places) the possibility of error and the need for proper error handling [grows exponentially][law]. When evaluating a distributed storage system, looking at how errors are handled is vital.]]></content><position>148</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>149</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [law]: http://en.wikipedia.org/wiki/Metcalfe's_law]]></content><position>150</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>151</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ * * *]]></content><position>152</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>153</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Another reason to distribute data among multiple servers is capacity. The three metrics of interest here are *read requests*, *write requests* and *data*. If you have more requests or data than a single machine can handle, you need to move to multiple machines. Each metric calls for different strategies, but they often go along with each other. The need for fault tolerance that I discussed above needs to be considered alongside.]]></content><position>154</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>155</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Growing read capacity is relatively easy once you covered the base case where the source for reading data might not be the same as the the target for writing data and that there can be a mismatch (cf. [eventual consistency][ec]).]]></content><position>156</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>157</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [ec]: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html]]></content><position>158</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>159</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Distributing writes and data works by designating two machines with 50% of the operations. A clever intermediate, a proxy server for example, decides which request goes where and all is well, we can store twice as much and we can store at twice the speed. When we need to grow bigger yet, we add another server and tell the proxy server to distribute the load equally among them. Adding a proxy for distribution introduces a single point of failure and you don't want these; there's added complexity with this approach.]]></content><position>160</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 18:23:23 +0200 2009</updated_at><version>2</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 18:23:23 +0200 2009</created_at><updated_at>Tue Jun 16 18:23:23 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Distributing writes and data works by designating two machines with 50% of the operations. A clever intermediate, a proxy server for example, decides which request goes where and all is well, we can store twice as much and we can store at twice the speed. When we need to grow bigger yet, we add another server and tell the proxy server to distribute the load equally among <del class="diffmod">them.</del><ins class="diffmod">them. Adding a proxy for distribution introduces a single point of failure and you don't want these; there's added complexity with this approach.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 18:23:23 +0200 2009</created_at><updated_at>Tue Jun 16 18:23:23 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span> Distributing writes and data works by designating two machines with 50% of the operations. A clever intermediate, a proxy server for example, decides which request goes where and all is well, we can store twice as much and we can store at twice the speed. When we need to grow bigger yet, we add another server and tell the proxy server to distribute the load equally among <del class="diffmod">them.</del><ins class="diffmod">them. Adding a proxy for distribution introduces a single point of failure and you don't want these; there's added complexity with this approach.</ins>]]></content><user_name>jan</user_name></comment><comment><created_at>Tue Jun 16 17:06:49 +0200 2009</created_at><updated_at>Tue Jun 16 17:06:49 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ Mention, that this is one method of distributing writes and that this method has a SPOF.]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>161</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ <div style="text-align:center;"><img src="http://jan.prima.de/~jan/plok/uploads/resharding.png" alt="resharding.png" border="0" width="477" height="522" /></div>]]></content><position>162</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 17:20:55 +0200 2009</updated_at><version>2</version><user_name>CK</user_name><comments><comment><created_at>Tue Jun 16 17:20:55 +0200 2009</created_at><updated_at>Tue Jun 16 17:20:55 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//resharding.png" alt="resharding.png" border="0" width="477" height="522" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/resharding.png" alt="resharding.png" border="0" width="477" height="522" /></ins></div>]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>163</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ The diagram shows that there is another step needed that wasn't included in the above description. The new “node” needs to have a copy of all data items that are assigned to him and are currently living on the two existing nodes. The process of moving data items to new nodes is called *resharding* and needs to happen every time a new node is added.]]></content><position>164</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>165</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ Resharding can be an expensive operation if you have a lot of data. Techniques like [consistent hashing][ch] help with minimising the amount of items that need to move. If you are looking at a sharding database, you want to understand how the sharding is performed and if you like the trade-offs.]]></content><position>166</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>167</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [ch]: http://en.wikipedia.org/wiki/Consistent_hashing]]></content><position>168</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>169</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 17:13:09 +0200 2009</created_at><updated_at>Tue Jun 16 17:13:09 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ It might also be interesting to describe things like split-brain...]]></content><user_name>Andres</user_name></comment></comments></line><line><color></color><content><![CDATA[ ##### CAP Theorem]]></content><position>170</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>171</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ The [CAP Theorem][cap] states that out of *consistency*, *availability* and *partition tolerance*, a system can choose to support two at any given moment, but never three.]]></content><position>172</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>173</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ <div style="text-align:center;"><img src="http://jan.prima.de/~jan/plok/uploads/cap.png" alt="cap.png" border="0" width="407" height="415" /></div>]]></content><position>174</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 17:22:20 +0200 2009</updated_at><version>2</version><user_name>CK</user_name><comments><comment><created_at>Tue Jun 16 17:22:20 +0200 2009</created_at><updated_at>Tue Jun 16 17:22:20 +0200 2009</updated_at><line_update>true</line_update><content><![CDATA[ <span class="info">edit:</span><div style="text-align:center;"><del class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads//cap.png" alt="cap.png" border="0" width="407" height="415" /></del><ins class="diffmod"><img src="http://jan.prima.de/~jan/plok/uploads/cap.png" alt="cap.png" border="0" width="407" height="415" /></ins></div>]]></content><user_name>CK</user_name></comment></comments></line><line><color></color><content><![CDATA[ ]]></content><position>175</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ *Consistency* guarantees that all clients that talk to cluster of nodes will always get to read the same data. Write operations are *atomic* on all nodes.]]></content><position>176</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>177</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ *Availability* guarantees that in any (reasonable) failure scenario, clients are still able to access their data.]]></content><position>178</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>179</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ *Partition tolerance* guarantees that when nodes in the cluster lose their network connection and two or more completely separated sub-clusters emerge, the system will still be able to store and retrieve data.]]></content><position>180</position><created_at>Tue Jun 16 14:11:19 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:19 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>181</position><created_at>Tue Jun 16 14:11:20 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:20 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ [cap]: http://citeseer.ist.psu.edu/544596.html]]></content><position>182</position><created_at>Tue Jun 16 14:11:20 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:20 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>183</position><created_at>Tue Jun 16 14:11:20 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:20 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ##### Please Talk! (To Developers)]]></content><position>184</position><created_at>Tue Jun 16 14:11:20 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:20 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ ]]></content><position>185</position><created_at>Tue Jun 16 14:11:20 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:20 +0200 2009</updated_at><version>1</version><user_name></user_name><comments></comments></line><line><color></color><content><![CDATA[ If you are aiming for a comparative benchmark of two or more systems, you should run your procedure by they authors. I found developers are happy to help out with benchmarks by clearing up misconceptions or sharing tricks to speed things up (which you can choose to ignore, if you are looking for out-of-the box comparison, but this is rarely useful).]]></content><position>186</position><created_at>Tue Jun 16 14:11:20 +0200 2009</created_at><updated_at>Tue Jun 16 14:11:20 +0200 2009</updated_at><version>1</version><user_name></user_name><comments><comment><created_at>Tue Jun 16 22:57:29 +0200 2009</created_at><updated_at>Tue Jun 16 22:57:29 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ ansonsten: super!]]></content><user_name>langalex</user_name></comment><comment><created_at>Tue Jun 16 22:57:17 +0200 2009</created_at><updated_at>Tue Jun 16 22:57:17 +0200 2009</updated_at><line_update>false</line_update><content><![CDATA[ fehlt da noch das ende? recap?]]></content><user_name>langalex</user_name></comment></comments></line></lines></discussion>
