/**************************************************************** Copyright (C) Lucent Technologies 1997 All Rights Reserved Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that the copyright notice and this permission notice and warranty disclaimer appear in supporting documentation, and that the name Lucent Technologies or any of its entities not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission. LUCENT DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL LUCENT OR ANY OF ITS ENTITIES BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. ****************************************************************/ /* Suffix sort Peter M. McIlroy M. Douglas McIlroy Prototype int ssort(int a[], int s[]); Purpose Return in a[] a suffix array for the original contents of a[]. (The original values in a[] are typically serial numbers of distinct tokens in some list.) Optionally return in s[] an array of shared lengths between adjacent sorted suffixes. Precondition Array a[] holds n values, with n>=1. Exactly k distinct values, in the range 0..k-1, are present. Value 0, an endmark, appears exactly once, at a[n-1]. Pointer s is 0 or points to an array of n elements. Postcondition Array a[] is a copy of the internal array p[] that records the sorting permutation: if i0, permutation p[] lists h-grams in lexicographic order: if i enum { ORIG = ~(~0u>>1), /* sign bit */ BUCK = ~(~0u>>1) }; void shared(int a[], int n, int p[], int q[], int s[], int h); #define pred(i, h) ((t=(i)-(h))<0? t+n: t) #define succ(i, h) ((t=(i)+(h))>=n? t-n: t) int ssort(int a[], int s[]) { int h, i, j, l, n, t; int k = 0; /* initialized for lint */ int *p = 0; int result = 0; int *q = 0; # define al a # define pl p # define finish(r) if(1){result=r; goto out;}else for(j=n=0; a[n]>0; n++) /* find n */ if(a[n] > j) j = a[n]; /* and max element */ if(a[n++]<0 || j>=n) finish(2); p = malloc(n*sizeof(int)); if(p == 0) finish(1); for(i=0; i=0; ) { /* (2) order */ j = pl[k]; do p[--i] = j; while(!((j=al[j]) & ORIG)); p[i] |= BUCK; } for(i=0; i= n) break; } for(i=0; i0. For each new bucket i created in the current pass, q[i] is (notionally) made to point to the previous bucket, j. The shared length between the i-1st and i-th smallest suffix is stored in s[i]. The shared length between two arbitrary buckets is the minimum of the stored shared lengths on the paths to their nearest common ancestor (excluding that ancestor). For efficient search, paths are compressed by making each bucket point to the next bucket on the path to the root with a strictly smaller stored shared length. This may change the identity of the nearest common ancestor, but does not affect shared lengths. Cost There are 3 serial passes and 1 random pass over size-n arrays per call, plus the inner loops of steps (2) and (3), each of which is executed n times fpr each call of ssort(). The asymptotic worst case has not been analyzed, however the code has always been seen to be at least as fast as a guaranteed O(n log n) method that takes twice as much code. (That method keeps in element i of an integer array the minimum value of s[] over the largest aligned power-of-2 interval beginning at i.) (0) On the first pass, with h=0, the shared length between buckets is zero, and every bucket bcomes a child of the root node, 0. (1) On subsequent passes the old bucket that each potential new bucket refines is remembered in a[]. Array a[] is ordered by substring places, while q[] is ordered by substring values. If the values placed in a[] were kept in another array, the loops of shared() could be combined with loops of lssort(). (2) When an h-gram bucket is split, the shared lengths between adjacent subbuckets is h+u, where u is the shared length between the h-successors that differed (causing the split). To find u, we locate the nearest common ancestor. The common ancestor is found by searching the path from the higher-numbered bucket, k1, until reaching a bucket whose number does not exceed that of the lower-numbered bucket, k0. This method is justified by two observations. All the new subbuckets have strictly longer shared lengths than all the old buckets. All the subbuckets of a bucket fall between the bucket head and its next higher numbered sibling. (3) After new buckets have been processed, the tree q is updated. New buckets are recognized by having mark BUCK and q[i]<0. */ void shared(int a[], int n, int p[], int q[], int s[], int h) { int i, j, k0, k1, t, u; if(h == 0) { /* (0) initialize */ for(i=0; i= 0) j = i; a[p[i]&~BUCK] = j; } for(i=0; i= 0) continue; for(u=n; k1>k0; k1=q[k1]) if(s[k1] < u) u = s[k1]; s[i] = h + u; } for(i=0; i s[t]) break; } j = i; } }