Since GSOC is finally wrapping up, I pretty much spent this week reading over code in the PR and writing documentation. I introduced a new section on table indexing in the docs after the "Table operations" section, which should give a good introduction to indexing functionality. It also links to an IPython notebook I wrote (http://nbviewer.ipython.org/github/mdmueller/astropy-notebooks/blob/master/table/indexing-profiling.ipynb) that displays some profiling results of indexing by comparing different scenarios, e.g. testing different engines and using regular columns vs. mixins. I also ran the asv benchmarking tool on features relevant to indexing, and fixed an issue with Table sorting in which performance was slowed down while sorting a primary index.
There's not much else to describe in terms of final changes, although I do worry about areas where index copying or relabeling come up unexpectedly and have a negative effect on performance. As an example, using the `loc` attribute on Table is very slow for an indexing engine like FastRBT (which is slow to copy), since the returned rows of the Table are retrieved via a slice that relabels indices. This is necessary if the user wants indices in the returned slice, but I doubt that's usually a real issue. I guess the two alternatives here are either to have `loc` return something else (like a non-indexed slice) or to simply advise in the documentation that using the index mode 'discard_on_copy' is appropriate in such a scenario.
Tuesday, August 18, 2015
Monday, August 10, 2015
Week 11
This week I implemented nice bit of functionality that Tom suggested, inspired by a similar feature in Pandas: retrieving index information via Table attributes `loc` and `iloc`. The idea is to provide a mechanism for row retrieval in between a high-level `query()` method and dealing with Index objects directly. Here's an example:
```
In [2]: t = simple_table(10)
In [3]: print t
a b c
--- ---- ---
1 1.0 c
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k
10 10.0 l
In [4]: t.add_index('a')
In [5]: t.add_index('b')
In [6]: t.loc[4:9] # 'a' is the implicit primary key
Out[6]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k
In [7]: t.loc['b', 1.5:7.0]
Out[7]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
In [8]: t.iloc[2:4]
Out[8]:
<Table length=2>
a b c
int32 float32 str1
----- ------- ----
3 3.0 e
4 4.0 f
```
The `loc` attribute is used for retrieval by column value, while `iloc` is used for retrieval by position in the sorted order of an index. This involves the designation of a primary key, which for now is just the first index added to the table. Also, indices can now be retrieved by column name(s):
```
In [9]: t.indices['b']
Out[9]:
b rows
---- ----
1.0 0
2.0 1
3.0 2
4.0 3
5.0 4
6.0 5
7.0 6
8.0 7
9.0 8
10.0 9
```
Aside from this, I've been adding in miscellaneous changes to the PR, such as getting `np.lexsort` to work with Time objects, reworking the `SortedArray` class to use a `Table` object instead of a list of ndarrays (for working with mixins), putting `index_mode` in `Table`, etc. Tom noted some performance issues when working with indices, which I've been working on as well.
```
In [2]: t = simple_table(10)
In [3]: print t
a b c
--- ---- ---
1 1.0 c
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k
10 10.0 l
In [4]: t.add_index('a')
In [5]: t.add_index('b')
In [6]: t.loc[4:9] # 'a' is the implicit primary key
Out[6]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
8 8.0 j
9 9.0 k
In [7]: t.loc['b', 1.5:7.0]
Out[7]:
<Table length=6>
a b c
int32 float32 str1
----- ------- ----
2 2.0 d
3 3.0 e
4 4.0 f
5 5.0 g
6 6.0 h
7 7.0 i
In [8]: t.iloc[2:4]
Out[8]:
<Table length=2>
a b c
int32 float32 str1
----- ------- ----
3 3.0 e
4 4.0 f
```
The `loc` attribute is used for retrieval by column value, while `iloc` is used for retrieval by position in the sorted order of an index. This involves the designation of a primary key, which for now is just the first index added to the table. Also, indices can now be retrieved by column name(s):
```
In [9]: t.indices['b']
Out[9]:
b rows
---- ----
1.0 0
2.0 1
3.0 2
4.0 3
5.0 4
6.0 5
7.0 6
8.0 7
9.0 8
10.0 9
```
Aside from this, I've been adding in miscellaneous changes to the PR, such as getting `np.lexsort` to work with Time objects, reworking the `SortedArray` class to use a `Table` object instead of a list of ndarrays (for working with mixins), putting `index_mode` in `Table`, etc. Tom noted some performance issues when working with indices, which I've been working on as well.
Tuesday, August 4, 2015
Week 10
This week wasn't terribly eventful; I spent time documenting code, expanding tests, etc. for the pull request. Docstrings are now in numpydoc format, and I fixed a few bugs including one that Tom noticed when taking a slice of a slice:
```
from astropy import table
from astropy.table import table_helpers
t = table_helpers.simple_table(10)
t.add_index('a')
t2 = t[1:]
t3 = t2[1:]
print(t3.indices[0])
```
The former output was "Index slice (2, 10, 2) of [[ 1 2 3 4 5 6 7 8 9 10], [0 1 2 3 4 5 6 7 8 9]]" while now the step size is 1, as it should be. The SlicedIndex system seems to be working fine otherwise, except for a python3 bug I found involving the new behavior of the / operator (i.e. it returns a float), though this is fixed now.
Another new change is to the `index_mode` context manager--the "copy_on_getitem" mode now properly affects only the supplied table rather than tampering with BaseColumn directly. Michael's workaround is to change the __class__ attribute of each relevant column to a subclass (either _GetitemColumn or _GetitemMaskedColumn) with the correct __getitem__ method, and this should rule out possible unlikely side effects. Aside from this, I've also been looking into improving the performance of the engines other than SortedArray. The main issue I see is that there's a lot of Python object creation in the engine initialization, which unfortunately seems to be unavoidable given the constraints of the bintrees library. The success of SortedArray really lies in the fact that it deals with numpy arrays, so I'm looking into creating an ndarray-based binary search tree.
```
from astropy import table
from astropy.table import table_helpers
t = table_helpers.simple_table(10)
t.add_index('a')
t2 = t[1:]
t3 = t2[1:]
print(t3.indices[0])
```
The former output was "Index slice (2, 10, 2) of [[ 1 2 3 4 5 6 7 8 9 10], [0 1 2 3 4 5 6 7 8 9]]" while now the step size is 1, as it should be. The SlicedIndex system seems to be working fine otherwise, except for a python3 bug I found involving the new behavior of the / operator (i.e. it returns a float), though this is fixed now.
Another new change is to the `index_mode` context manager--the "copy_on_getitem" mode now properly affects only the supplied table rather than tampering with BaseColumn directly. Michael's workaround is to change the __class__ attribute of each relevant column to a subclass (either _GetitemColumn or _GetitemMaskedColumn) with the correct __getitem__ method, and this should rule out possible unlikely side effects. Aside from this, I've also been looking into improving the performance of the engines other than SortedArray. The main issue I see is that there's a lot of Python object creation in the engine initialization, which unfortunately seems to be unavoidable given the constraints of the bintrees library. The success of SortedArray really lies in the fact that it deals with numpy arrays, so I'm looking into creating an ndarray-based binary search tree.
Subscribe to:
Posts (Atom)