
I've been using Python, use it to process a variety of data science projects. Python to ease famous. There are coding experience to learn a few days will be able to use (or use it effectively).
It sounds great, but if you only use Python, but also in other languages, such as C, then perhaps there will be some problems.
To give you an example of my own experience of it. I have a good command languages, such as C and C ++. Of ancient classical languages such as Lisp and Prolog can skillfully use. In addition, I also used Java, Javascript and PHP for some time. (So, learning) Python for me is not very simple? In fact, just it looks easy, I dug a hole for himself: I like to use the same C with Python.
Specifically, please look down.
On a recent project, the need to deal with geospatial data. Given (task) is gps tracking about 25,000 location points, needs a given latitude and longitude, repositioning the shortest distance point. My first reaction was that a search (already implemented) computing code fragment known distance between two points of latitude and longitude. Code can John D. Cook wrote this code available in the public domain in the find.
! As long as everything is ready to write a Python function that returns the shortest distance with the input coordinate point index (25,000 points array index), everything will be fine:
def closest_distance (lat, lon, trkpts):
d = 100000.0
best = 1
r = trkpts.index
for i in r:
lati = trkpts.ix [i, 'Lat']
loni = trkpts.ix [i, 'Lon']
md = distance_on_unit_sphere (lat, lon, lati, loni)
if d> md
best = i
d = md
return best
Wherein, distance_on_unit_sphere is a function of John D. Cook's book, trkpts is an array containing the coordinates gps tracking (in fact, the data frame pandas, note, pandas are python thirdparty data analysis extension pack).
I used the above function is implemented in C function is basically the same. It traverses (iteration) trkpts array, so far (from the given coordinate position) of the shortest distance point index, save it to a local variable in the best.
So far, the situation is still good, although Python syntax and C there are many differences, but to write the code, and I have not spent too much time.
Write code fast, but very slow to implement. For example, I specify 428 points, named waypoints (waypoints, waypoint, route navigation key points). Navigation, I find the shortest distance to waypoint point for each waypoint. To 428 waypoints waypoint to find the shortest distance from the point of the program, in my notebook ran 3 minutes and 6 seconds.
After that, I changed the query to calculate the Manhattan distance, which is an approximation. I do not calculate the exact distance between two points, but the calculation of distance eastwest axis and northsouth axis distance. Calculated Manhattan distance function as follows:
def manhattan_distance (lat1, lon1, lat2, lon2):
lat = (lat1 + lat2) /2.0
return abs (lat1lat2) + abs (math.cos (math.radians (lat)) * (lon1lon2))
In fact, I used a simpler function, ignoring a factor, that the gap dimension curve 1 degree longitude gap than 1 degree curve is much greater. Simplify function is as follows:
def manhattan_distance1 (lat1, lon1, lat2, lon2):
return abs (lat1lat2) + abs (lon1lon2)
function closest amended as follows:
def closest_manhattan_distance1 (lat, lon, trkpts):
d = 100000.0
best = 1
r = trkpts.index
for i in r:
lati = trkpts.ix [i, 'Lat']
loni = trkpts.ix [i, 'Lon']
md = manhattan_distance1 (lat, lon, lati, loni)
if d> md
best = i
d = md
return best
If you change the function body Manhattan_distance come faster speed can also:
def closest_manhattan_distance2 (lat, lon, trkpts):
d = 100000.0
best = 1
r = trkpts.index
for i in r:
lati = trkpts.ix [i, 'Lat']
loni = trkpts.ix [i, 'Lon']
md = abs (latlati) + abs (lonloni)
if d> md
best = i
d = md
return best
On the shortest distance point calculation, use this function with the same John's function effect. I hope that my intuition was right. The simpler the faster. Now this procedure with 2 minutes 37 seconds. Speed by 18%. Good, but not enough exciting.
I decided to use the proper Python. This means that you want to take advantage of pandas support array operation. These arithmetic operations from numpy array package. By calling these array operation, code more concise:
def closest (lat, lon, trkpts):
cl = numpy.abs (trkpts.Lat  lat) + numpy.abs (trkpts.Lon  lon)
return cl.idxmin ()
This function returns the same result as the previous function. In my notebook run time it took 0.5 seconds. Full 300 times faster! 300 times ,, that is 30,000%. Incredible. Speed is the reason numpy array arithmetic operations using C. Therefore, we will combine the best of both sides: we get C speed and simplicity of Python.
The lesson is clear: do not use the C way to write Python code. With numpy array operations, do not traverse an array. For me, this is a change in thinking.
Update on July 2, 2015. This paper discusses the Hacker News. Some commentators did not notice (missed) I used the situation pandas data frame. Mainly because it is very commonly used in the data analysis. If I just want to quickly query the shortest distance between the point and I am full time, I can use C or C ++ quadtree (to achieve).
Second update on July 2, 2015. There are also comments mentioned numba code speed. I tried it.
This is my approach, and not necessarily the same in your case. First, note that the results of different python installation version, not necessarily the same experiment. My test environment is installed on windows system Anaconda, also installed some expansion pack. There may be interference between these packages and numba. .
First, enter the following command to install, install numba:
$ Conda install numba
This is the feedback I have a command line interface:
After I found out, numba already exist in the anaconda installation kit. Installation instructions may also have to change eventually.
Recommended numba usage:
@jit
def closest_func (lat, lon, trkpts, func):
d = 100000.0
best = 1
r = trkpts.index
for i in r:
lati = trkpts.ix [i, 'Lat']
loni = trkpts.ix [i, 'Lon']
md = abs (lat  lati) + abs (lon  loni)
if d> md:
#print d, dlat, dlon, lati, loni
best = i
d = md
return best
I did not find time to improve run. I also tried a more aggressive compilation parameter settings:
@jit (nopython = True)
def closest_func (lat, lon, trkpts, func):
d = 100000.0
best = 1
r = trkpts.index
for i in r:
lati = trkpts.ix [i, 'Lat']
loni = trkpts.ix [i, 'Lon']
md = abs (lat  lati) + abs (lon  loni)
if d> md:
#print d, dlat, dlon, lati, loni
best = i
d = md
return best
When this code is run, an error
It seems, pandas smarter than numba handling code.
Of course, I can take the time to modify the data structure, the numba correctly compiled (compile). But why should I do that? With numpy to write code that runs fast enough. Anyway, I have been using numpy and pandas. Why not continue to use it?
I have suggested that I use pypy. It certainly makes sense, but ... I use Jupyter notebooks on the hosting server (note, online browser python interactive development environment). I use it provides python core, that is, the official (regular) Python 2.7.x kernel. It does not provide Pypy choice.
Also suggested Cython. Well, if I go back to compile the code, I simply implement in C and C ++ just fine. I use python, because it offers based notebooks (Note: The Web version of the online development environment) of interactive features, you can achieve rapid prototyping. This is not Cython design goals. 


