[Arjen Markus] This lengthy title is meant to introduce a subject for the mathematically interested. It has to do with replacing strings by other strings in a short text and suddenly finding memory has exhausted. ---- Superexponential growth in text manipulation This note discusses a small problem in the area of text manipulation. To get some feeling for this subject, it is perhaps good to know the circumstances under which it arose. I was trsting a small utility that would edit text files based on a few simple commands: you could replace strings that matched a certain pattern, by something else, you could insert the contents of another text file after a line containing a string pattern or replace that text by it. That was (and is) about all the utility did. While I was testing it, I used the same small text file (some 20 lines of text) over and over again and though I had noticed that its size grew, I had not expected to run out of memory - at last the utility needed some 100 MB of memory all of a sudden. This led me to consider the mathematics of the process in more detail. We can strip it down to an example like below: * Take the string "AbbAbb" * Insert after each "A" the original string and repeat this procedure * The resulting strings are: Initial string: AbbAbb After one round: AAbbAbbbbAAbbAbbbb After two rounds: AAAbbAbbbbAAbbAbbbbAAAbbAbbbbAAbbAbbbbbbAAAbbAbbbb- AAbbAbbbbbbbbAAAbbAbbbbAAbbAbbbbAAAbbAbbbbAAbbAbbbbbb- AAAbbAbbbbAAbbAbbbbbbbb As you can see, it is growing very fast! There are a number of variations possible: 1. Insert a fixed string that does not contain "A" 2. Insert a fixed string that does contain one or more "A"s 3. Replace "A" by a such a fixed string 4. Replace "A" by the string from the previous iteration, rather than insert that string after the "A". Let us examine what happens in each of these cases: 1. As the number of "A"s does not change, the length of the resulting string is a linear function of the iteration step: at each step ''n.m'' characters are added, where n is the number of "A"s in the original string and l is the length of the inserted string: N(k+1) = N(k) L(k+1) = L(k) + N m 2. If, at step k, the string contains N(k) "A"s and the fixed string contains ''n'' "A"s and has a total length m, then the resulting string from the next round has N(k) extra copies of the fixed string. So: N(k+1) = N(k) + N(k).n = (n+1) N(k) L(k+1) = L(k) + N(k) m The solution is: N(k) = (n+1)^(k-1) N(0) L(k) = L(0) + N(0) m ((n+1)^k -1)/n 3. When replacing the A, the formula changes, as N(k) "A"s will be deleted: N(k+1) = N(k) + N(k) .n - N(k) = n N(k) L(k+1) = L(k) + N(k) (m-1) And the solution is: N(k) = n^(k-1) N(0) L(k) = L(0) + N(0) m (n^k -1)/(n-1) 4. In this case N(k) "A"s will be replaced by a string of length L(k) and containing N(k) "A"s as well. So: So: N(k+1) = N(k).N(k) = N(k)^2 L(k+1) = L(k) + N(k) (L(k)-1) = N(k) L(k) A closed solution for N(k) is easy to find: N(k) = N(0)^P(k) , P(k) = 2^k The solution for L(k) is found when you take the logarithm of the above equation: ln L(k+1) = ln L(k) + ln N(k) = ln L(k) +2^k ln N(0) Therefore: L(k) = L(0) + N(0)P'(k) , P'(k) = 2^(2^(k+1)-1) 5. The original problem is described by the following equations: N(k+1) = N(k) + N(k).N(k) = (1+N(k)) N(k) L(k+1) = L(k) + N(k).L(k) = (1+N(k)) L(k) As far as I know this problem has no closed solution. Below is a table that compares these five problems (N(0)=n=2, L(0) = m = 6): Iteration Case 1 Case 2 Case 3 Case 4 Case 5 k Nk, Lk Nk, Lk Nk, Lk Nk, Lk Nk, Lk --------- ------ ------ ------ ------ ------ 0 2, 6 2, 6 2, 6 2, 6 2, 6 1 2, 18 6, 18 4, 16 4, 12 6, 18 2 2, 30 18, 54 8, 36 16, 48 42, 126 3 2, 42 54, 162 16, 76 256, 748 1806, 5418 4 2, 54 162, 486 32, 156 65536, 196608 3263442, 9790326