Introduction

I have developed a python utility to process the Greenplum in objects in bulk and in parallel. This utility can perform any task that we can do using SQL queries like VACUUM, VACUUM FULL, ANALYZE, REINDEX, changing permissions and many.

This utility uses 3 modules such as process, objects and configuration module. You can find these modules in config directory.

Process module

In this module, We can define what task we have to perform such VACUUM, REINDEX or any other task you would like to do.

Configuration Module

In this module, We have to specify the database name and maximum parallel threads

Object module

In this module, We can fetch the database objects using SQL query. If fetching a set of objects using SQL query is not possible, We can mention all objects in single file (one object per line) and --ad-hoc option to specify to file name.

For Example:

python gpopb.py --ad-hoc <path/to/file>

Installation

you can clone GPOPB using below command start using it.

git clone https://github.com/pgyogesh/GPOPB.git

Utility

gpopb.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from multiprocessing import Pool, Value
from config import configuration
from config import objects
from config import process
from gppylib import gplog
import optparse

# Logger
logger = gplog.get_default_logger()
gplog.setup_tool_logging("gpopb", '', "gpadmin")

# Command line option parser
parser = optparse.OptionParser()
parser.add_option('--ad-hoc', dest='adhoc', action='store', help="Specify filename for list of objects")
options, args = parser.parse_args()

max_processes = configuration.MAX_THREADS

logger.info("Getting list of objects:")
db_objects = []
if options.adhoc:
        for line in open(options.adhoc,'r'):
                db_objects.append(line.rstrip('\n'))
else:
        db_objects = objects.get_objects()

logger.info("Objects to be processed: %s" %',\n'.join(db_objects))

def init(args):
    ''' store the counter for later use '''
    global counter
    counter = args

pool = Pool(initializer=init, initargs=(process.counter, ), processes=max_processes)
pool.map(process.task, db_objects)

pool.close()
pool.join()
logger.info("Object processing completed")

This the main program file we have to run.

Help:

1
2
3
4
5
6
[gpadmin@mdw GPOPB]$ python gpopb.py --help
Usage: gpopb.py [options]

Options:
  -h, --help      show this help message and exit
  --ad-hoc=ADHOC  Specify filename for list of objects

--ad-hoc option allows us to specify the file which contains list of objects.

process.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from multiprocessing import Value
from pygresql.pg import DB
from gppylib import gplog
import configuration
logger = gplog.get_default_logger()
gplog.setup_tool_logging("gpopb", '', "gpadmin")

database = configuration.DATABASE
counter = Value('i', 0)
def task(db_object):
        global counter
        con = DB(dbname = database)
        con.query("vacuum %s" %(db_object))
        con.close()
        with counter.get_lock():
                counter.value += 1
        logger.info(str(counter.value) + " objects completed")

if __name__ == '__main__':
        print("this program should not be running alone :P")

The above process module is example for VACUUM task (Line 13). Similar way, we can write queries for other tasks. See few examples below:

1
2
3
4
5
6
7
8
9
10
11
-- ANALYZE
con.query("ANALYZE %s" %(db_object))

-- VACUUM ANALYZE
con.query("VACUUM ANALYZE %s" %(db_object))

-- CREATE TABLE AS SELECT -- Alternative for VACUUM FULL
con.query("CREATE TABLE %s_temp AS SELECT * FROM %s" %(db_object,db_object))
con.query("ALTER TABLE %s RENAME TO %s_hold" %(db_object))
con.query("ALTER TABLE %s_temp RENAME TO %s" %(db_object,db_object))
con.query("DROP TABLE %s_hold" %s(db_object))

objects.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import configuration
from pygresql.pg import DB

def get_objects():
        database = configuration.DATABASE
        con = DB(dbname=database)
        tables = con.query("select schemaname||'.'||tablename as tablename from pg_tables where schemaname='public'")
        tabledict = tables.dictresult()
        tablelist = []
        con.close()
        for dict in tabledict:
                tablelist.append(dict.get('tablename')) # You should replace the 'tablename' with column name from your SQL query in tables variable
        return tablelist

if __name__ == '__main__':
        print(get_objects())

This module is to fetch the database objects like tables and indexes (line 7).

configuration.py

1
2
MAX_THREADS = 8
DATABASE = 'gpadmin'

In configuration module, We can configure the maximum parallel processes and database name.

Demo

Without –ad-hoc option

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
[gpadmin@mdw GPOPB]$ python gpopb.py
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-Getting list of objects:
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-Objects to be processed: public.tdbschemasize1,
public.tdbschemasize2,
public.vacuum_history,
public.ttblsize,
public.test11,
public.t3,
public.test12,
public.gpcrondump_history_test,
public.reorg_history,
public.gpcrondump_history,
public.test_yogesh,
public.ajm_xx_master_delete,
public.test_after_upgrade,
public.tdbsize,
public.test_sravan,
public.sales_1_prt_1,
public.sales_1_prt_2,
public.sales_1_prt_3,
public.sales_1_prt_4,
public.sales_1_prt_5,
public.sales_1_prt_6,
public.sales_1_prt_7,
public.sales_1_prt_8,
public.sales_1_prt_9,
public.sales_1_prt_10,
public.sales_1_prt_11,
public.sales_1_prt_12,
public.sales_1_prt_13,
public.sales
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-1 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-2 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-3 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-4 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-5 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-6 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-7 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-8 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-9 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-10 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-11 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-12 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-13 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-14 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-15 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-16 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-17 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-18 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-19 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-20 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-21 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-22 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-23 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-24 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-25 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-26 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-27 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-28 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-29 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-Object processing completed

pg_stat_activity

pg_stat_activity during above run.

1
2
3
4
5
6
7
8
9
10
 database | process_id | session_id | username |             current_query             | query_duration  | session_duration | is_waiting
----------+------------+------------+----------+---------------------------------------+-----------------+------------------+------------
 gpadmin  |      70739 |    4526832 | gpadmin  | vacuum public.tdbschemasize2          | 00:00:00.974272 | 00:00:00.980811  | f
 gpadmin  |      70741 |    4526834 | gpadmin  | vacuum public.ttblsize                | 00:00:00.973872 | 00:00:00.97985   | f
 gpadmin  |      70744 |    4526837 | gpadmin  | vacuum public.test12                  | 00:00:00.973427 | 00:00:00.978471  | f
 gpadmin  |      70738 |    4526831 | gpadmin  | vacuum public.tdbschemasize1          | 00:00:00.971366 | 00:00:00.981336  | f
 gpadmin  |      70740 |    4526833 | gpadmin  | vacuum public.vacuum_history          | 00:00:00.971037 | 00:00:00.980355  | f
 gpadmin  |      70742 |    4526835 | gpadmin  | vacuum public.test11                  | 00:00:00.970573 | 00:00:00.979375  | f
 gpadmin  |      70743 |    4526836 | gpadmin  | vacuum public.t3                      | 00:00:00.969172 | 00:00:00.978906  | f
 gpadmin  |      70745 |    4526838 | gpadmin  | vacuum public.gpcrondump_history_test | 00:00:00.968709 | 00:00:00.978118  | f

With –ad-hoc option

1
2
3
4
5
6
7
[gpadmin@mdw GPOPB]$ python gpopb.py --ad-hoc test.list
20180724:11:45:47:074087 gpopb::gpadmin-[INFO]:-Getting list of objects:
20180724:11:45:47:074087 gpopb::gpadmin-[INFO]:-Objects to be processed: public.vacuum_history,
public.tdbschemasize1
20180724:11:45:48:074087 gpopb::gpadmin-[INFO]:-1 objects completed
20180724:11:45:48:074087 gpopb::gpadmin-[INFO]:-2 objects completed
20180724:11:45:48:074087 gpopb::gpadmin-[INFO]:-Object processing completed

Let me know if you have any issues, suggestions in comment box below. Also, I would like to see what you achieve with this. Thanks :smiley: