Arrays | hstack | NumPy | Python Methods and Functions

** numpy.hstack() ** is used to horizontally stack a sequence of input arrays (i.e. column-wise) into one array.

Syntax:numpy.hstack (tup)

Parameters:

tup:[sequence of ndarrays] Tuple containing arrays to be stacked. The arrays must have the same shape along all but the second axis.

Return:[stacked ndarray] The stacked array of the input arrays.

** Code # 1: **

` ` |

** Output: **

1st Input array: [1 2 3] 2nd Input array: [4 5 6] Output horizontally stacked array: [1 2 3 4 5 6]

** Code # 2: **

` `

` ` ` # Python program explaining `

` # hstack () function `

` import `

` numpy as geek `

` # input array `

` in_arr1 `

` = `

` geek.array ([[`

` 1 `

`, `

` 2 `

`, `

` 3 `

`], [`

` - `` 1 `

`, `

` - `

` 2 `

`, `

` - `

` 3 `

`]]) `

` ` ` print `

` (`

` " 1st Input array: "`

`, in_arr1) `

` `

` in_arr2 `

` = `

` geek.array ([[`

` 4 `

`, `

` 5 `

`, `

` 6 `

`], [`

` - `

` 4 ``, `

` - `

` 5 `

`, `

` - `

` 6 `

`]]) `

` ` ` print `

` (`

` "2nd Input array:" `

`, in_arr2) `

` `

` # Stack two arrays horizontally `

` ` ` out_arr `

` = `

` geek.hstack ((in_arr1, in_arr2 )) `

` print `

` (`

` "Output stacked array: "`

`, out_arr) `

` `

** Output: **

1st Input array: [[1 2 3] [-1 -2 -3]] 2nd Input array: [[4 5 6] [-4 -5 - 6]] Output stacked array: [[1 2 3 4 5 6] [-1 -2 -3 -4 -5 -6]]

The simplest way to get row counts per group is by calling `.size()`

, which returns a `Series`

:

```
df.groupby(["col1","col2"]).size()
```

Usually you want this result as a `DataFrame`

(instead of a `Series`

) so you can do:

```
df.groupby(["col1", "col2"]).size().reset_index(name="counts")
```

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

Consider the following example dataframe:

```
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
```

First let"s use `.size()`

to get the row counts:

```
In [3]: df.groupby(["col1", "col2"]).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
```

Then let"s use `.size().reset_index(name="counts")`

to get the row counts:

```
In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
```

When you want to calculate statistics on grouped data, it usually looks like this:

```
In [5]: (df
...: .groupby(["col1", "col2"])
...: .agg({
...: "col3": ["mean", "count"],
...: "col4": ["median", "min", "count"]
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
```

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`

. It looks like this:

```
In [6]: gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...: .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...: .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...: .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
```

The code used to generate the test data is shown below:

```
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["C", "D"],
...: ["C", "D"],
...: ["C", "D"],
...: ["E", "F"],
...: ["E", "F"],
...: ["G", "H"]
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...: df[["col3", "col4", "col5", "col6"]].astype(float)
...:
```

**Disclaimer:**

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN`

entries in the mean calculation without telling you about it.

You can consider shapely:

```
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
point = Point(0.5, 0.5)
polygon = Polygon([(0, 0), (0, 1), (1, 1), (1, 0)])
print(polygon.contains(point))
```

From the methods you"ve mentioned I"ve only used the second, `path.contains_points`

, and it works fine. In any case depending on the precision you need for your test I would suggest creating a numpy bool grid with all nodes inside the polygon to be True (False if not). If you are going to make a test for a lot of points this might be faster (**although notice this relies you are making a test within a "pixel" tolerance**):

```
from matplotlib import path
import matplotlib.pyplot as plt
import numpy as np
first = -3
size = (3-first)/100
xv,yv = np.meshgrid(np.linspace(-3,3,100),np.linspace(-3,3,100))
p = path.Path([(0,0), (0, 1), (1, 1), (1, 0)]) # square with legs length 1 and bottom left corner at the origin
flags = p.contains_points(np.hstack((xv.flatten()[:,np.newaxis],yv.flatten()[:,np.newaxis])))
grid = np.zeros((101,101),dtype="bool")
grid[((xv.flatten()-first)/size).astype("int"),((yv.flatten()-first)/size).astype("int")] = flags
xi,yi = np.random.randint(-300,300,100)/100,np.random.randint(-300,300,100)/100
vflag = grid[((xi-first)/size).astype("int"),((yi-first)/size).astype("int")]
plt.imshow(grid.T,origin="lower",interpolation="nearest",cmap="binary")
plt.scatter(((xi-first)/size).astype("int"),((yi-first)/size).astype("int"),c=vflag,cmap="Greens",s=90)
plt.show()
```

, the results is this:

I would try this:

```
import numpy as np
import PIL
from PIL import Image
list_im = ["Test1.jpg", "Test2.jpg", "Test3.jpg"]
imgs = [ PIL.Image.open(i) for i in list_im ]
# pick the image which is the smallest, and resize the others to match it (can be arbitrary image shape here)
min_shape = sorted( [(np.sum(i.size), i.size ) for i in imgs])[0][1]
imgs_comb = np.hstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )
# save that beautiful picture
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( "Trifecta.jpg" )
# for a vertical stacking it is simple: use vstack
imgs_comb = np.vstack( (np.asarray( i.resize(min_shape) ) for i in imgs ) )
imgs_comb = PIL.Image.fromarray( imgs_comb)
imgs_comb.save( "Trifecta_vertical.jpg" )
```

It should work as long as all images are of the same variety (all RGB, all RGBA, or all grayscale). It shouldn"t be difficult to ensure this is the case with a few more lines of code. Here are my example images, and the result:

`np.r_[ ... ]`

and `np.c_[ ... ]`

are useful alternatives to `vstack`

and `hstack`

,
with square brackets [] instead of round ().

A couple of examples:

```
: import numpy as np
: N = 3
: A = np.eye(N)
: np.c_[ A, np.ones(N) ] # add a column
array([[ 1., 0., 0., 1.],
[ 0., 1., 0., 1.],
[ 0., 0., 1., 1.]])
: np.c_[ np.ones(N), A, np.ones(N) ] # or two
array([[ 1., 1., 0., 0., 1.],
[ 1., 0., 1., 0., 1.],
[ 1., 0., 0., 1., 1.]])
: np.r_[ A, [A[1]] ] # add a row
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.]])
: # not np.r_[ A, A[1] ]
: np.r_[ A[0], 1, 2, 3, A[1] ] # mix vecs and scalars
array([ 1., 0., 0., 1., 2., 3., 0., 1., 0.])
: np.r_[ A[0], [1, 2, 3], A[1] ] # lists
array([ 1., 0., 0., 1., 2., 3., 0., 1., 0.])
: np.r_[ A[0], (1, 2, 3), A[1] ] # tuples
array([ 1., 0., 0., 1., 2., 3., 0., 1., 0.])
: np.r_[ A[0], 1:4, A[1] ] # same, 1:4 == arange(1,4) == 1,2,3
array([ 1., 0., 0., 1., 2., 3., 0., 1., 0.])
```

(The reason for square brackets [] instead of round () is that Python expands e.g. 1:4 in square -- the wonders of overloading.)

I think a more straightforward solution and faster to boot is to do the following:

```
import numpy as np
N = 10
a = np.random.rand(N,N)
b = np.zeros((N,N+1))
b[:,:-1] = a
```

And timings:

```
In [23]: N = 10
In [24]: a = np.random.rand(N,N)
In [25]: %timeit b = np.hstack((a,np.zeros((a.shape[0],1))))
10000 loops, best of 3: 19.6 us per loop
In [27]: %timeit b = np.zeros((a.shape[0],a.shape[1]+1)); b[:,:-1] = a
100000 loops, best of 3: 5.62 us per loop
```

In general you can concatenate a whole sequence of arrays along any axis:

```
numpy.concatenate( LIST, axis=0 )
```

but you **do** have to worry about the shape and dimensionality of each array in the list (for a 2-dimensional 3x5 output, you need to ensure that they are all 2-dimensional n-by-5 arrays already). If you want to concatenate 1-dimensional arrays as the rows of a 2-dimensional output, you need to expand their dimensionality.

As Jorge"s answer points out, there is also the function `stack`

, introduced in numpy 1.10:

```
numpy.stack( LIST, axis=0 )
```

This takes the complementary approach: it creates a new view of each input array and adds an extra dimension (in this case, on the left, so each `n`

-element 1D array becomes a 1-by-`n`

2D array) before concatenating. It will only work if all the input arrays have the same shape‚Äîeven along the axis of concatenation.

`vstack`

(or equivalently `row_stack`

) is often an easier-to-use solution because it will take a sequence of 1- and/or 2-dimensional arrays and expand the dimensionality automatically where necessary and only where necessary, before concatenating the whole list together. Where a new dimension is required, it is added on the left. Again, you can concatenate a whole list at once without needing to iterate:

```
numpy.vstack( LIST )
```

This flexible behavior is also exhibited by the syntactic shortcut `numpy.r_[ array1, ...., arrayN ]`

(note the square brackets). This is good for concatenating a few explicitly-named arrays but is no good for your situation because this syntax will not accept a sequence of arrays, like your `LIST`

.

There is also an analogous function `column_stack`

and shortcut `c_[...]`

, for horizontal (column-wise) stacking, as well as an *almost*-analogous function `hstack`

‚Äîalthough for some reason the latter is less flexible (it is stricter about input arrays" dimensionality, and tries to concatenate 1-D arrays end-to-end instead of treating them as columns).

Finally, in the specific case of vertical stacking of 1-D arrays, the following also works:

```
numpy.array( LIST )
```

...because arrays can be constructed out of a sequence of other arrays, adding a new dimension to the beginning.

Some of the main advantages of HDF5 are its hierarchical structure (similar to folders/files), optional arbitrary metadata stored with each item, and its flexibility (e.g. compression). This organizational structure and metadata storage may sound trivial, but it"s very useful in practice.

Another advantage of HDF is that the datasets can be either fixed-size *or* flexibly sized. Therefore, it"s easy to append data to a large dataset without having to create an entire new copy.

Additionally, HDF5 is a standardized format with libraries available for almost any language, so sharing your on-disk data between, say Matlab, Fortran, R, C, and Python is very easy with HDF. (To be fair, it"s not too hard with a big binary array, too, as long as you"re aware of the C vs. F ordering and know the shape, dtype, etc of the stored array.)

**Just as the TL/DR:** For an ~8GB 3D array, reading a "full" slice along any axis took ~20 seconds with a chunked HDF5 dataset, and 0.3 seconds (best-case) to *over three hours* (worst case) for a memmapped array of the same data.

Beyond the things listed above, there"s another big advantage to a "chunked"* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.

`*`

(HDF5 doesn"t have to be a chunked data format. It supports chunking, but doesn"t require it. In fact, the default for creating a dataset in `h5py`

is not to chunk, if I recall correctly.)

Basically, your best case disk-read speed and your worst case disk read speed for a given slice of your dataset will be fairly close with a chunked HDF dataset (assuming you chose a reasonable chunk size or let a library choose one for you). With a simple binary array, the best-case is faster, but the worst-case is *much* worse.

One caveat, if you have an SSD, you likely won"t notice a huge difference in read/write speed. With a regular hard drive, though, sequential reads are much, much faster than random reads. (i.e. A regular hard drive has long `seek`

time.) HDF still has an advantage on an SSD, but it"s more due its other features (e.g. metadata, organization, etc) than due to raw speed.

First off, to clear up confusion, accessing an `h5py`

dataset returns an object that behaves fairly similarly to a numpy array, but does not load the data into memory until it"s sliced. (Similar to memmap, but not identical.) Have a look at the `h5py`

introduction for more information.

Slicing the dataset will load a subset of the data into memory, but presumably you want to do something with it, at which point you"ll need it in memory anyway.

If you do want to do out-of-core computations, you can fairly easily for tabular data with `pandas`

or `pytables`

. It is possible with `h5py`

(nicer for big N-D arrays), but you need to drop down to a touch lower level and handle the iteration yourself.

However, the future of numpy-like out-of-core computations is Blaze. Have a look at it if you really want to take that route.

First off, consider a 3D C-ordered array written to disk (I"ll simulate it by calling `arr.ravel()`

and printing the result, to make things more visible):

```
In [1]: import numpy as np
In [2]: arr = np.arange(4*6*6).reshape(4,6,6)
In [3]: arr
Out[3]:
array([[[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[ 12, 13, 14, 15, 16, 17],
[ 18, 19, 20, 21, 22, 23],
[ 24, 25, 26, 27, 28, 29],
[ 30, 31, 32, 33, 34, 35]],
[[ 36, 37, 38, 39, 40, 41],
[ 42, 43, 44, 45, 46, 47],
[ 48, 49, 50, 51, 52, 53],
[ 54, 55, 56, 57, 58, 59],
[ 60, 61, 62, 63, 64, 65],
[ 66, 67, 68, 69, 70, 71]],
[[ 72, 73, 74, 75, 76, 77],
[ 78, 79, 80, 81, 82, 83],
[ 84, 85, 86, 87, 88, 89],
[ 90, 91, 92, 93, 94, 95],
[ 96, 97, 98, 99, 100, 101],
[102, 103, 104, 105, 106, 107]],
[[108, 109, 110, 111, 112, 113],
[114, 115, 116, 117, 118, 119],
[120, 121, 122, 123, 124, 125],
[126, 127, 128, 129, 130, 131],
[132, 133, 134, 135, 136, 137],
[138, 139, 140, 141, 142, 143]]])
```

The values would be stored on-disk sequentially as shown on line 4 below. (Let"s ignore filesystem details and fragmentation for the moment.)

```
In [4]: arr.ravel(order="C")
Out[4]:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143])
```

In the best case scenario, let"s take a slice along the first axis. Notice that these are just the first 36 values of the array. This will be a *very* fast read! (one seek, one read)

```
In [5]: arr[0,:,:]
Out[5]:
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
```

Similarly, the next slice along the first axis will just be the next 36 values. To read a complete slice along this axis, we only need one `seek`

operation. If all we"re going to be reading is various slices along this axis, then this is the perfect file structure.

However, let"s consider the worst-case scenario: A slice along the last axis.

```
In [6]: arr[:,:,0]
Out[6]:
array([[ 0, 6, 12, 18, 24, 30],
[ 36, 42, 48, 54, 60, 66],
[ 72, 78, 84, 90, 96, 102],
[108, 114, 120, 126, 132, 138]])
```

To read this slice in, we need 36 seeks and 36 reads, as all of the values are separated on disk. None of them are adjacent!

This may seem pretty minor, but as we get to larger and larger arrays, the number and size of the `seek`

operations grows rapidly. For a large-ish (~10Gb) 3D array stored in this way and read in via `memmap`

, reading a full slice along the "worst" axis can easily take tens of minutes, even with modern hardware. At the same time, a slice along the best axis can take less than a second. For simplicity, I"m only showing "full" slices along a single axis, but the exact same thing happens with arbitrary slices of any subset of the data.

Incidentally there are several file formats that take advantage of this and basically store three copies of *huge* 3D arrays on disk: one in C-order, one in F-order, and one in the intermediate between the two. (An example of this is Geoprobe"s D3D format, though I"m not sure it"s documented anywhere.) Who cares if the final file size is 4TB, storage is cheap! The crazy thing about that is that because the main use case is extracting a single sub-slice in each direction, the reads you want to make are very, very fast. It works very well!

Let"s say we store 2x2x2 "chunks" of the 3D array as contiguous blocks on disk. In other words, something like:

```
nx, ny, nz = arr.shape
slices = []
for i in range(0, nx, 2):
for j in range(0, ny, 2):
for k in range(0, nz, 2):
slices.append((slice(i, i+2), slice(j, j+2), slice(k, k+2)))
chunked = np.hstack([arr[chunk].ravel() for chunk in slices])
```

So the data on disk would look like `chunked`

:

```
array([ 0, 1, 6, 7, 36, 37, 42, 43, 2, 3, 8, 9, 38,
39, 44, 45, 4, 5, 10, 11, 40, 41, 46, 47, 12, 13,
18, 19, 48, 49, 54, 55, 14, 15, 20, 21, 50, 51, 56,
57, 16, 17, 22, 23, 52, 53, 58, 59, 24, 25, 30, 31,
60, 61, 66, 67, 26, 27, 32, 33, 62, 63, 68, 69, 28,
29, 34, 35, 64, 65, 70, 71, 72, 73, 78, 79, 108, 109,
114, 115, 74, 75, 80, 81, 110, 111, 116, 117, 76, 77, 82,
83, 112, 113, 118, 119, 84, 85, 90, 91, 120, 121, 126, 127,
86, 87, 92, 93, 122, 123, 128, 129, 88, 89, 94, 95, 124,
125, 130, 131, 96, 97, 102, 103, 132, 133, 138, 139, 98, 99,
104, 105, 134, 135, 140, 141, 100, 101, 106, 107, 136, 137, 142, 143])
```

And just to show that they"re 2x2x2 blocks of `arr`

, notice that these are the first 8 values of `chunked`

:

```
In [9]: arr[:2, :2, :2]
Out[9]:
array([[[ 0, 1],
[ 6, 7]],
[[36, 37],
[42, 43]]])
```

To read in any slice along an axis, we"d read in either 6 or 9 contiguous chunks (twice as much data as we need) and then only keep the portion we wanted. That"s a worst-case maximum of 9 seeks vs a maximum of 36 seeks for the non-chunked version. (But the best case is still 6 seeks vs 1 for the memmapped array.) Because sequential reads are very fast compared to seeks, this significantly reduces the amount of time it takes to read an arbitrary subset into memory. Once again, this effect becomes larger with larger arrays.

HDF5 takes this a few steps farther. The chunks don"t have to be stored contiguously, and they"re indexed by a B-Tree. Furthermore, they don"t have to be the same size on disk, so compression can be applied to each chunk.

`h5py`

By default, `h5py`

doesn"t created chunked HDF files on disk (I think `pytables`

does, by contrast). If you specify `chunks=True`

when creating the dataset, however, you"ll get a chunked array on disk.

As a quick, minimal example:

```
import numpy as np
import h5py
data = np.random.random((100, 100, 100))
with h5py.File("test.hdf", "w") as outfile:
dset = outfile.create_dataset("a_descriptive_name", data=data, chunks=True)
dset.attrs["some key"] = "Did you want some metadata?"
```

Note that `chunks=True`

tells `h5py`

to automatically pick a chunk size for us. If you know more about your most common use-case, you can optimize the chunk size/shape by specifying a shape tuple (e.g. `(2,2,2)`

in the simple example above). This allows you to make reads along a particular axis more efficient or optimize for reads/writes of a certain size.

Just to emphasize the point, let"s compare reading in slices from a chunked HDF5 dataset and a large (~8GB), Fortran-ordered 3D array containing the same exact data.

I"ve cleared all OS caches between each run, so we"re seeing the "cold" performance.

For each file type, we"ll test reading in a "full" x-slice along the first axis and a "full" z-slize along the last axis. For the Fortran-ordered memmapped array, the "x" slice is the worst case, and the "z" slice is the best case.

The code used is in a gist (including creating the `hdf`

file). I can"t easily share the data used here, but you could simulate it by an array of zeros of the same shape (`621, 4991, 2600)`

and type `np.uint8`

.

The `chunked_hdf.py`

looks like this:

```
import sys
import h5py
def main():
data = read()
if sys.argv[1] == "x":
x_slice(data)
elif sys.argv[1] == "z":
z_slice(data)
def read():
f = h5py.File("/tmp/test.hdf5", "r")
return f["seismic_volume"]
def z_slice(data):
return data[:,:,0]
def x_slice(data):
return data[0,:,:]
main()
```

`memmapped_array.py`

is similar, but has a touch more complexity to ensure the slices are actually loaded into memory (by default, another `memmapped`

array would be returned, which wouldn"t be an apples-to-apples comparison).

```
import numpy as np
import sys
def main():
data = read()
if sys.argv[1] == "x":
x_slice(data)
elif sys.argv[1] == "z":
z_slice(data)
def read():
big_binary_filename = "/data/nankai/data/Volumes/kumdep01_flipY.3dv.vol"
shape = 621, 4991, 2600
header_len = 3072
data = np.memmap(filename=big_binary_filename, mode="r", offset=header_len,
order="F", shape=shape, dtype=np.uint8)
return data
def z_slice(data):
dat = np.empty(data.shape[:2], dtype=data.dtype)
dat[:] = data[:,:,0]
return dat
def x_slice(data):
dat = np.empty(data.shape[1:], dtype=data.dtype)
dat[:] = data[0,:,:]
return dat
main()
```

Let"s have a look at the HDF performance first:

```
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python chunked_hdf.py z
python chunked_hdf.py z 0.64s user 0.28s system 3% cpu 23.800 total
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python chunked_hdf.py x
python chunked_hdf.py x 0.12s user 0.30s system 1% cpu 21.856 total
```

A "full" x-slice and a "full" z-slice take about the same amount of time (~20sec). Considering this is an 8GB array, that"s not too bad. Most of the time

And if we compare this to the memmapped array times (it"s Fortran-ordered: A "z-slice" is the best case and an "x-slice" is the worst case.):

```
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python memmapped_array.py z
python memmapped_array.py z 0.07s user 0.04s system 28% cpu 0.385 total
jofer at cornbread in ~
$ sudo ./clear_cache.sh
jofer at cornbread in ~
$ time python memmapped_array.py x
python memmapped_array.py x 2.46s user 37.24s system 0% cpu 3:35:26.85 total
```

Yes, you read that right. 0.3 seconds for one slice direction and ~3.5 *hours* for the other.

The time to slice in the "x" direction is *far* longer than the amount of time it would take to load the entire 8GB array into memory and select the slice we wanted! (Again, this is a Fortran-ordered array. The opposite x/z slice timing would be the case for a C-ordered array.)

However, if we"re always wanting to take a slice along the best-case direction, the big binary array on disk is very good. (~0.3 sec!)

With a memmapped array, you"re stuck with this I/O discrepancy (or perhaps anisotropy is a better term). However, with a chunked HDF dataset, you can choose the chunksize such that access is either equal or is optimized for a particular use-case. It gives you a lot more flexibility.

Hopefully that helps clear up one part of your question, at any rate. HDF5 has many other advantages over "raw" memmaps, but I don"t have room to expand on all of them here. Compression can speed some things up (the data I work with doesn"t benefit much from compression, so I rarely use it), and OS-level caching often plays more nicely with HDF5 files than with "raw" memmaps. Beyond that, HDF5 is a really fantastic container format. It gives you a lot of flexibility in managing your data, and can be used from more or less any programming language.

Overall, try it and see if it works well for your use case. I think you might be surprised.

If you’ve picked up this book, you’re probably aware of the extraordinary progress that deep learning has represented for the field of artificial intelligence in the recent past. In a mere five ye...

23/09/2020

Data and storage models are the basis for big data ecosystem stacks. While storage model captures the physical aspects and features for data storage, data model captures the logical representation and...

10/07/2020

The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data-quickly and with a high quality. The intent is to cover the theory, research, develo...

10/07/2020

Systems programming provides the basis for global calculation. Developing performance-sensitive code requires a programming language that allows programmers to control the use of memory, processor tim...

23/09/2021

X
# Submit new EBook