Introduction
I have developed a python utility to process the Greenplum in objects in bulk and in parallel. This utility can perform any task that we can do using SQL queries like VACUUM, VACUUM FULL, ANALYZE, REINDEX, changing permissions and many.
This utility uses 3 modules such as process, objects and configuration module. You can find these modules in config
directory.
Process module
In this module, We can define what task we have to perform such VACUUM, REINDEX or any other task you would like to do.
Configuration Module
In this module, We have to specify the database name and maximum parallel threads
Object module
In this module, We can fetch the database objects using SQL query. If fetching a set of objects using SQL query is not possible, We can mention all objects in single file (one object per line) and --ad-hoc
option to specify to file name.
For Example:
python gpopb.py --ad-hoc <path/to/file>
Installation
you can clone GPOPB using below command start using it.
git clone https://github.com/pgyogesh/GPOPB.git
Utility
gpopb.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from multiprocessing import Pool, Value
from config import configuration
from config import objects
from config import process
from gppylib import gplog
import optparse
# Logger
logger = gplog.get_default_logger()
gplog.setup_tool_logging("gpopb", '', "gpadmin")
# Command line option parser
parser = optparse.OptionParser()
parser.add_option('--ad-hoc', dest='adhoc', action='store', help="Specify filename for list of objects")
options, args = parser.parse_args()
max_processes = configuration.MAX_THREADS
logger.info("Getting list of objects:")
db_objects = []
if options.adhoc:
for line in open(options.adhoc,'r'):
db_objects.append(line.rstrip('\n'))
else:
db_objects = objects.get_objects()
logger.info("Objects to be processed: %s" %',\n'.join(db_objects))
def init(args):
''' store the counter for later use '''
global counter
counter = args
pool = Pool(initializer=init, initargs=(process.counter, ), processes=max_processes)
pool.map(process.task, db_objects)
pool.close()
pool.join()
logger.info("Object processing completed")
This the main program file we have to run.
Help:
1
2
3
4
5
6
[gpadmin@mdw GPOPB]$ python gpopb.py --help
Usage: gpopb.py [options]
Options:
-h, --help show this help message and exit
--ad-hoc=ADHOC Specify filename for list of objects
--ad-hoc
option allows us to specify the file which contains list of objects.
process.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from multiprocessing import Value
from pygresql.pg import DB
from gppylib import gplog
import configuration
logger = gplog.get_default_logger()
gplog.setup_tool_logging("gpopb", '', "gpadmin")
database = configuration.DATABASE
counter = Value('i', 0)
def task(db_object):
global counter
con = DB(dbname = database)
con.query("vacuum %s" %(db_object))
con.close()
with counter.get_lock():
counter.value += 1
logger.info(str(counter.value) + " objects completed")
if __name__ == '__main__':
print("this program should not be running alone :P")
The above process module is example for VACUUM task (Line 13). Similar way, we can write queries for other tasks. See few examples below:
1
2
3
4
5
6
7
8
9
10
11
-- ANALYZE
con.query("ANALYZE %s" %(db_object))
-- VACUUM ANALYZE
con.query("VACUUM ANALYZE %s" %(db_object))
-- CREATE TABLE AS SELECT -- Alternative for VACUUM FULL
con.query("CREATE TABLE %s_temp AS SELECT * FROM %s" %(db_object,db_object))
con.query("ALTER TABLE %s RENAME TO %s_hold" %(db_object))
con.query("ALTER TABLE %s_temp RENAME TO %s" %(db_object,db_object))
con.query("DROP TABLE %s_hold" %s(db_object))
objects.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import configuration
from pygresql.pg import DB
def get_objects():
database = configuration.DATABASE
con = DB(dbname=database)
tables = con.query("select schemaname||'.'||tablename as tablename from pg_tables where schemaname='public'")
tabledict = tables.dictresult()
tablelist = []
con.close()
for dict in tabledict:
tablelist.append(dict.get('tablename')) # You should replace the 'tablename' with column name from your SQL query in tables variable
return tablelist
if __name__ == '__main__':
print(get_objects())
This module is to fetch the database objects like tables and indexes (line 7).
configuration.py
1
2
MAX_THREADS = 8
DATABASE = 'gpadmin'
In configuration module, We can configure the maximum parallel processes and database name.
Demo
Without –ad-hoc option
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
[gpadmin@mdw GPOPB]$ python gpopb.py
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-Getting list of objects:
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-Objects to be processed: public.tdbschemasize1,
public.tdbschemasize2,
public.vacuum_history,
public.ttblsize,
public.test11,
public.t3,
public.test12,
public.gpcrondump_history_test,
public.reorg_history,
public.gpcrondump_history,
public.test_yogesh,
public.ajm_xx_master_delete,
public.test_after_upgrade,
public.tdbsize,
public.test_sravan,
public.sales_1_prt_1,
public.sales_1_prt_2,
public.sales_1_prt_3,
public.sales_1_prt_4,
public.sales_1_prt_5,
public.sales_1_prt_6,
public.sales_1_prt_7,
public.sales_1_prt_8,
public.sales_1_prt_9,
public.sales_1_prt_10,
public.sales_1_prt_11,
public.sales_1_prt_12,
public.sales_1_prt_13,
public.sales
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-1 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-2 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-3 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-4 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-5 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-6 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-7 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-8 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-9 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-10 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-11 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-12 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-13 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-14 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-15 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-16 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-17 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-18 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-19 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-20 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-21 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-22 objects completed
20180724:11:45:57:074134 gpopb::gpadmin-[INFO]:-23 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-24 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-25 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-26 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-27 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-28 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-29 objects completed
20180724:11:45:58:074134 gpopb::gpadmin-[INFO]:-Object processing completed
pg_stat_activity
pg_stat_activity during above run.
1
2
3
4
5
6
7
8
9
10
database | process_id | session_id | username | current_query | query_duration | session_duration | is_waiting
----------+------------+------------+----------+---------------------------------------+-----------------+------------------+------------
gpadmin | 70739 | 4526832 | gpadmin | vacuum public.tdbschemasize2 | 00:00:00.974272 | 00:00:00.980811 | f
gpadmin | 70741 | 4526834 | gpadmin | vacuum public.ttblsize | 00:00:00.973872 | 00:00:00.97985 | f
gpadmin | 70744 | 4526837 | gpadmin | vacuum public.test12 | 00:00:00.973427 | 00:00:00.978471 | f
gpadmin | 70738 | 4526831 | gpadmin | vacuum public.tdbschemasize1 | 00:00:00.971366 | 00:00:00.981336 | f
gpadmin | 70740 | 4526833 | gpadmin | vacuum public.vacuum_history | 00:00:00.971037 | 00:00:00.980355 | f
gpadmin | 70742 | 4526835 | gpadmin | vacuum public.test11 | 00:00:00.970573 | 00:00:00.979375 | f
gpadmin | 70743 | 4526836 | gpadmin | vacuum public.t3 | 00:00:00.969172 | 00:00:00.978906 | f
gpadmin | 70745 | 4526838 | gpadmin | vacuum public.gpcrondump_history_test | 00:00:00.968709 | 00:00:00.978118 | f
With –ad-hoc option
1
2
3
4
5
6
7
[gpadmin@mdw GPOPB]$ python gpopb.py --ad-hoc test.list
20180724:11:45:47:074087 gpopb::gpadmin-[INFO]:-Getting list of objects:
20180724:11:45:47:074087 gpopb::gpadmin-[INFO]:-Objects to be processed: public.vacuum_history,
public.tdbschemasize1
20180724:11:45:48:074087 gpopb::gpadmin-[INFO]:-1 objects completed
20180724:11:45:48:074087 gpopb::gpadmin-[INFO]:-2 objects completed
20180724:11:45:48:074087 gpopb::gpadmin-[INFO]:-Object processing completed
Let me know if you have any issues, suggestions in comment box below. Also, I would like to see what you achieve with this. Thanks