Differences between revisions 7 and 8

Word Design

Since Sage-combinat days in Orsay last January, Franco Saliola, Vincent Delecroix and Sébastien Labbé have been rethinking the design of the sage-words library. We call this new design the words_next_generation or simply words_ng. We did a lot of work during the Sage-combinat days but did not finished it. So, quickly the new code got rotten because of the spinxification that got in Sage in February. On June 9th, the ReST documentation was finished in all the files of words_ng. We hope to finish this big work really soon.

GOAL

The goal of the new design of the words library is to separate the data structures from the mathematical objects which will improve greatly the effectiveness of what is actually in sage.

Mathematical Objects :

Classes of words :
- Combinatorial class of all words
- Combinatorial class of all words over a given alphabet
Words :
- Finite words
- Infinite words

Data Structures :

Python lists
Python string
Python tuple
Python functions
Python iterators
C++ vector (by Vincent Delecroix, Marseille)

TO BE DONE

1. Concatenation (done)

Create a class for concatenating words; we want to at least be able to do what the old code could do.

DONE by Franco :

"I implemented a class called CallableFromListOfWords, which creates a callable object from a list/tuple of words (it is just a tuple with the call method define). This is an improved version of what was there before. Take a look at the patch called words_ng_concatenation-fs.patch."

2. Add doctests

A bunch of stuff is missing doctests.

Here is the coverage in date of June 11th:

~/sage-4.0/devel/sage-combinat/sage/combinat/words$ sage -coverage .
alphabet.py: 100% (27 of 27)
morphism.py: 100% (35 of 35)
shuffle_product.py: 100% (14 of 14)
suffix_trees.py: 97% (46 of 47)
word.py: 97% (146 of 150)
word_datatypes.pyx: 0% (0 of 61)
word_generators.py: 95% (19 of 20)
word_infinite_datatypes.py: 81% (18 of 22)
word_options.py: 100% (1 of 1)
words.py: 100% (38 of 38)

Overall weighted coverage score:  82.6%
Total number of functions:  415
We need    9 more function to get to 85% coverage.
We need   30 more function to get to 90% coverage.
We need   51 more function to get to 95% coverage.

There are already many doctest inside of word_datatypes.pyx but they are not seen by the coverage script. I think that this problem is related to http://trac.sagemath.org/sage_trac/ticket/1795 which has a patch but still needs work.

3. ReST the documentation (done)

Convert the documentation to the ReST format.

DONE (finished June 9th 2009)

4. Run the old doctests (done)

Run all the old doctests against with the new code and see what breaks. This is mainly to test for backwards compatibility; and to test to see if we deleted some methods (then we have to deprecate them).

DONE by Franco :

"Today I re-ran the old doctests, and posted a small patch (called words_ng_small_fix-fs.patch) that dealt with one issue that I found. There is nothing left to do with the old doctests.

[Note that if you run the old tests, then you will see lots of errors. What I did was to go through each error and decide whether it was really an error. If it was, then it got fixed. Some of the old tests break because the new representation is different than the old one. Some doctests that test internal functions break as well, but that is okay since they are internal functions that are not available to the user.]"

5. Performance testing

We should compare the timing between the new and old code. Here is a start :

sage: s = [0,1,2,3,0,1,2,3]*10
sage: w1 = wold.Word(s)
sage: time w1.critical_exponent()
CPU times: user 1.11 s, sys: 0.00 s, total: 1.11 s
Wall time: 1.11 s
20
sage: w2 = Word(s)
sage: time w2.critical_exponent()
CPU times: user 0.26 s, sys: 0.00 s, total: 0.26 s
Wall time: 0.27 s
20
sage: w3 = Word(s, datatype="cpp_basic_string")
sage: time w3.critical_exponent()
CPU times: user 0.23 s, sys: 0.00 s, total: 0.23 s
Wall time: 0.23 s
20

6. Remove the repository sage/combinat/words_old

7. Make the words_ng patches commutable in the sage-combinat tree

Actually, they do not commute with generalized permutations patches because of small conflicts in the setup.py file.

8. Remove side effects of words_ng

Actually, the words_ng patches creates the empty repository sage/combinat/words_ng.

9. Fold all the patches together!!

10. Create a ticket on the sage trac

Discussions made at Orsay

This page discusses the specifications of methods in the class of Word. Most of the times the methods do what we want, but this page could help to define standards. Specifications must be precise because we accept that any Word_datatype overwrite methods defined in Word_all or Finiteword_all...

This page could also serve for discussion about improvement of words algorithms and words vocabulary.

In particular:

Does the library agree with the standard vocabulary for functions on strings (C string library, C++ string, python) ?
Does the type import in comparison, or must we impose type(self) == type(other) ? Now, the limit is not quite clear and depend on methods.
Do we authorize flexibility in argument, like in :

sage : w = wng.Word("abbaba")
sage : w.has_suffix("aba")
True
sage : w == "aba"
True
sage : w = wng.Word([0,1,0,0,1])
sage : w.has_suffix([0,0,1])
True

(we should think about a self._parent(other) at the beginning of each comparison method. Sadly, it's not a good idea : sometimes it's very expansive (a whole copy), and sometimes impossible (for example an infinite Word as argument))

Vocabulary of other libraries

Agreed from the python standard:

__add__(self, other): concatenation
__cmp__(self, other): rich comparison (this include equality and less than and greater than)

Ambiguous :

count(self, sub[, start[, stop]]): Return the number of non-overlapping occurrences of substring sub in string self[start:stop]

Not used :

partition(sep) -> (head, sep, tail): Searches for the separator sep in S, and returns the part before it, the separator itself, and the part after it. If the separator is not found, returns S and two empty strings
rpartition(sep) -> (tail, sep, head): Searches for the separator sep in S, starting at the end of S, and returns the part before it, the separator itself, and the part after it. If the separator is not found, returns two empty strings and S.
replace(old, new[, count]) -> string: Returnd a copy of string S with all occurrences of substring old replaced by new. If the ptional argument count is given, only the first count occurrences are replaced.
split([sep [,maxsplit]]) -> list of strings: Return a list of the words in the string S, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done. If sep is not specified or is None, any whitespace string is a separator.
rsplit([sep [,maxsplit]]) -> list of strings: Return a list of the words in the string S, using sep as the delimiter string, starting at the end of the string and working to the front. If maxsplit is given, at most maxsplit splits are done. If sep is not pecified or is None, any whitespace string is a separator.
translate(table [,deletechars]) -> string: Return a copy of the string S, where all characters occurring in the optional argument deletechars are removed, and the remaining characters have been mapped through the given translation table, which must be a string of length 256.

Modified:

find(self, string[, start[, stop]]) -> int: find the first occurrence of the string as a substring of self[start:stop] (return -1 if not find). It returns the index of the occurrence.
rfind(self, string[, start, stop]]) -> int: find the last occurrence of the string as a substring of self[start:stop] (return -1 if not find)
index(self, string[, start, stop]]) -> int: same as find but raise a ValueError if not find
rindex(self, string[, start, stop]]) -> int: same as rfind but raise a ValueError if not find
startswith(self, sub[, start[, stop]]) -> bool: tests if sub prefixes self[start:stop]
endswith(self, sub[,start[, stop]]) -> bool: tests if sub suffixes self[start:stop]

Comparison methods

A comparison involves two different words. But a word could be a lot of different object (in particular from a data point of vue).

Flexibility or not ?
What is exactly authorized ?

Improvement

We can improve the speed of a lot of algorithms providing two functions

def find_first_different(self, other)
def find_last_different(self, other)

which should returns the first index of the different characters and -1 if one of the two words ended before. Those one could be really fast in the cythonized version.

-  ⇤ ← Revision 7 as of 2009-06-11 18:20:43 → 
  Size: 8634
  Editor: slabbe
  Comment:
+   ← Revision 8 as of 2009-06-11 18:37:43 → ⇥
  Size: 9198
  Editor: slabbe
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
+Since Sage-combinat days in Orsay last January, Franco Saliola, Vincent Delecroix and Sébastien Labbé have been rethinking the design of the sage-words library. We call this new design the {{{words_next_generation}}} or simply {{{words_ng}}}. We did a lot of work during the Sage-combinat days but did not finished it. So, quickly the new code got rotten because of the spinxification that got in Sage in February. On June 9th, the ReST documentation was finished in all the files of {{{words_ng}}}. We hope to finish this big work really soon.

== GOAL ==

Diff for "WordDesign"