[prog] Converting definitions into 'on the fly' code...

Thu May 6 23:37:29 EST 2004

On Fri, May 07, 2004 at 12:10:45AM +1000 or so it is rumoured hereabouts, 
Rasjid Wilcox thought:
> On Thursday 06 May 2004 00:26, Conor Daly wrote:
> >
> > It may be that this is the way to go.  The question there is how will this
> > scale in the face of large amounts of data?  Can I write a Python module
> > for the testing bit and call this from a compiled core program (yes, I
> > probably can)?
> 
> Yes, you can embed Python into your C program, however, see
> http://www.twistedmatrix.com/users/glyph/rant/extendit.html for a rant on why 
> it may be better to *extend* Python with your C program, rather than embed 
> Python into it.

I suppose this becomes a question of whether Python is to manage the
entire process (in which case, extending Python with C modules looks
sensible) or just handle the test parsing element (in which case it should
be a module called by whatever process manager is running).

> > Spec:
> >
> > Given this string:
> >
> > "a < mean ( b[1] : b[10] ) / 2"
> >
> > 1. Analyse the tokens 'a', 'b[]' and locate the data
> > 2. Analyse the expression
> > 3. Conduct the specified operation on the data
> > 4. Return a result
> 
> Questions:
> (a) Is the syntax already defined, or is that just a sample you made up along 
> the lines what you had in mind?

No.  Elements like max() or () precedence operators would be desireable so
as to make a test read like a mathematical equation.  Elements like b[]
and b[1]:b[10] to denote array style operations, ':' is used in
spreadsheets for a range, ',' might be used for "b[1] and b[10]".
Operators such as '=', '>' will be interpreted as the 'test' with opposite
sides of the equation to be calculated.  It is not necessarily the case
that the LHS of an equation will be a single element.

> (b) Any chance you could send me a little sample data (offlist, perhaps two or 
> three tables with 10 rows per table)?  Or just make up some data if you like.

Will do, see below...

> (c) At least one, preferably two actual formulas, and *precisely* what they 
> mean.  In your above example, I don't know if you mean then mean of b[1] and 
> b[10], or the mean of b[1], b[2], .... , b[9], b[10].  I'm assuming you mean 
> the latter.

Yep, x:y means all elements between x and y .  Typical tests might be:

"Is the 3-hour pressure change supplied equal to the difference between
current pressure and pressure read three hours ago"

This might be denoted by:

a = ppp[h-3] - ppp[h]

or:

"Does the sum of minute by minute rainfall amounts equal the hourly
rainfall value"

as:

rrr[h] = sum(r[0]:r[59])

or:

"Is the max temperature at station x within 5 degrees of the highest of
the other stations"

as:

max( all(tx)) +5 <= tx <= max( all(tx)) -5

> (d) What is the database backend being used?  This is mainly relevant since 
> there is little point if the database concerned does not have any Python 
> drivers.

Currently Ingres but I want to remain as DBMS independent so we can move
to MySql or Postgresql or whatever later...  In any case, if the 'Index of
Tables" discussed below is in use, it should be possible to construct a
module that will translate:

"give me minute rainfall for station x on date blah"

into the relevant sql query and hand back the data.  In that case, it
wouldn't matter that Ingres didn't have Python drivers since the
'getdatafromdb' module could be written in whatever there were drivers for.

> My guess:  You have a whole lot of studies that have been done (and new ones 
> will be continued to be done in the future) that have *broadly* similar info, 
> but by no means exactly the same.  Some studies may have extra data that 
> other studies don't have.  Studies in the future may have extra data 
> (columns) that have not even been thought of yet.  And there is little 

Essentially this.  In our case, the 'studies' are weather observations and
the variations in info stem from the type of observing station (eg once
per day 'climate' station, hourly 'synoptic' station, continuously
observing automatic station).  A particular station may change status over
the years.  Our longest running 'climate' station at Phoenix Park in
Dublin which has been in operation since 1829 has recently been upgraded
to a continuously observing automatic.

> For what it is worth, the program that is my core development role at work has 
> a 'temporary' table database where one table keeps in index of all names of 
> the temporary tables, how long to keep them and what dataset they contain.  

This may be the way to go and I'm reasonably happy to step on the DB
designers' toes a bit if it'll help...  It is envisaged that there will be
numerous temporary tables in existence.

> Actually, I can think of a way that you may not even need to have the 'index' 
> table.
> 
> Suppose that Test X has paramaters 'a' (a single column) and b[] (an array of 
> columns of varying length depending on the study).
> 
> User says: I want to run Text X, on table Y.

Here it would be "I want to run Text X, on station Y"

> Program says: Select column that is paramater 'a'.
> Program says: Select columns that are 'b[]'.

"Select column from table that has paramater 'a' for station Y."
"Select columns from table that has 'b[]' for station Y."

> User hits 'GO'.

The user is most likely to be an automatic process.  

> Program show records that fail the test.

data that fails the test(s) will be flagged for later review by an
operator.  If course, the operator might wish to try out some other test
so I'll have to feed their test requests back to the parser also...

> Is this what you would like, or is it better to have the index table as you 
> have given above, since there is less chance of user error?

Not sure there, given my mods, it may not be feasible to do without the
index.

Sample tables...

Tablename: Index
Stno | parameter | description | start_date |   end_date | table   | column
-----+-----------+-------------+------------+------------+---------+-------
373  |    td     |   drybulb   | 1993-04-03 | 1995-12-31 | hours   | td
373  |    td     |   drybulb   | 1996-01-01 |            | minutes | td1
373  |    td     |   drybulb   | 1996-01-01 |            | minutes | td2
376  |    td     |   drybulb   | 1941-01-01 |            | days    | dblb
353  |    td     |   drybulb   | 1990-01-01 | 2002-04-24 | autohrs | drybulb
353  |    tx     |   max       | 1990-01-01 | 2002-04-24 | automax | maxt
353  |    a      |   tendency  | 1990-01-01 | 2002-04-24 | autohrs | tend   
353  |    ppp    |   pressure  | 1990-01-01 | 2002-04-24 | autohrs | msl    
353  |    rrr    |   rainfall  | 1990-01-01 | 2002-04-24 | autohrs | rainfall
353  |    r      |   rainfall  | 1990-01-01 | 2002-04-24 | automin | rainmin
-----+-----------+-------------+------------+------------+---------+-------

Tablename: autohrs
Stno | date                |   drybulb | dewpoint   | rainfall | msl    | tend
-----+---------------------+-----------+------------+----------+--------+------
353  | 12-03-1995 14:00:00 |    9.7    |     4.0    |     0    | 1033.1 | -2.5    
353  | 12-03-1995 15:00:00 |    9.4    |     4.0    |     0    | 1032.4 | -0.3    
353  | 12-03-1995 16:00:00 |    9.0    |     3.8    |     1.6  | 1031.0 | -3.0    
353  | 12-03-1995 17:00:00 |    8.7    |     3.5    |     4.3  | 1029.4 | -3.7    
-----+---------------------+-----------+------------+----------+--------+------

Tablename: automin
stno | date                | rainmin
-----+---------------------+---------
353  | 12-03-1995 16:01:00 |    0.1  
353  | 12-03-1995 16:02:00 |    0.0  
353  | 12-03-1995 16:03:00 |    0.1  
353  | 12-03-1995 16:04:00 |    0.1  
353  | 12-03-1995 16:05:00 |    0.0  
353  | 12-03-1995 16:06:00 |    0.0  
353  | 12-03-1995 16:07:00 |    0.0  
353  | 12-03-1995 16:08:00 |    0.1  
353  | 12-03-1995 16:09:00 |    0.2  
353  | 12-03-1995 16:10:00 |    0.2  
353  | 12-03-1995 16:11:00 |    0.2  
353  | 12-03-1995 16:12:00 |    0.3  
353  | 12-03-1995 16:13:00 |    0.2  
353  | 12-03-1995 16:14:00 |    0.1  
353  | 12-03-1995 16:15:00 |    0.3  
353  | 12-03-1995 16:16:00 |    0.2  
353  | 12-03-1995 16:17:00 |    0.3  
353  | 12-03-1995 16:18:00 |    0.1  
353  | 12-03-1995 16:19:00 |    0.2  
353  | 12-03-1995 16:20:00 |    0.1  
353  | 12-03-1995 16:21:00 |    0.1  
-----+---------------------+---------

Thanks for taking the trouble!

Conor
-- 
Conor Daly <conor.daly at oceanfree.net>

Domestic Sysadmin :-)
---------------------
Faenor.cod.ie
 10:53pm  up 196 days, 15:01,  0 users,  load average: 0.02, 0.15, 0.17
Hobbiton.cod.ie
 10:43pm  up 45 days,  2:09,  1 user,  load average: 0.00, 0.03, 0.05