cancel
Showing results for 
Search instead for 
Did you mean: 

Credit where credit’s due (Python)

Anonymous
Not applicable

Credit where credit’s due (Python)

I’m a C++ myself and have been for years but I had a need today to process an Excel file, extract the sheets then load each sheets’ data in to a table in my database, so I thought I’d give Python a spin. By the way, there’s no real reason for this post apart from the fact that I know others on the Forum use Python so I thought I’d mention this.

So some time reading and some Python code later and it’s all done! The script can process a small 3.3MB file with 6 sheets of data out to CSVs then load these from CSV in to their own table all in under 9 seconds!

Quite impressed I have to admit as I suspect it would have taken me way more than a few hours to do this is C++ land.

 

11 REPLIES 11
chenks76
All Star
Posts: 3,274
Thanks: 338
Fixes: 12
Registered: ‎24-10-2013

Re: Credit where credit’s due (Python)

wouldn't it have been much easier just to do it via a stored procedure within SQL itself?
Anonymous
Not applicable

Re: Credit where credit’s due (Python)

Form me @chenks76 in this instance not really. Some of the sheets have over 40 columns in them and writing a procedure for them would be tedious, then add in clock cost of the transactional overhead as well. What I needed was a Q'n'D solution and this is what I have and to be fair works really well.

The code bulks the data into tables contained in an import schema and from there via a stored procedure they’re moved to the application schema if all goes well.

MauriceC
Resting Legend
Posts: 4,085
Thanks: 929
Fixes: 17
Registered: ‎10-04-2007

Re: Credit where credit’s due (Python)

Interesting @Anonymous.  I've got a couple of eXcel problems that need some data manipulation / Migration.  I'll give Python a look over.

 

Maurice

Superusers are not staff, but they do have a direct line of communication into the business in order to raise issues, concerns and feedback from the community.

Anonymous
Not applicable

Re: Credit where credit’s due (Python)

I am on my iPad (in bed) at the moment @MauriceC, but if you want I will post the script tomorrow if it helps.

MauriceC
Resting Legend
Posts: 4,085
Thanks: 929
Fixes: 17
Registered: ‎10-04-2007

Re: Credit where credit’s due (Python)

Thanks @Anonymous  though I think mine is a simpler problem than yours.  An example script is often easier to follow than a tutorial.

 

M

Superusers are not staff, but they do have a direct line of communication into the business in order to raise issues, concerns and feedback from the community.

Anonymous
Not applicable

Re: Credit where credit’s due (Python)

Well here it is I thought I’d better comment it as well as it can be quite frustrating coming back to the code at a later date and scratching your head when you can’t remember what its doing!

Apologies for the size of the post but I’ve no other way to make it available. You should be made aware that this script uses a PostgreSQL backend but there is no reason why that couldn’t be modified for Oracle, MySQL or SSQ Server.

#!/usr/bin/python

import os, sys, csv, getopt, psycopg2
from timeit import default_timer as timer
from openpyxl import Workbook, load_workbook
from openpyxl.utils.exceptions import CellCoordinatesException, IllegalCharacterError, InsufficientCoordinatesException, InvalidFileException, NamedRangeException, SheetTitleException

#Define the Path to write the outputs to
outPath = "../CSVs"
#Define the SQL used to load the data
sqlLoad = """COPY Imports.%s FROM STDIN WITH HEADER DELIMITER ',' CSV"""
#Name or IP address of DB Server
pgServer = "postgresql"
#Custom string to use to format output file names
csvFileName = "%s/%s-%d-%d.csv"
#List used to store the processed sheet / table names
tableNames = list()
#Default User and Group IDs of data owner
uid = 1
gid = 1
#Variable used for filename passed on CLI
fileName = ""

def usage():
    print("[[-h] | [--help]]\t This help screen")
    print("[[-s] | [--server]]\t PostgreSQL Server (default 192.168.0.128)")
    print("-f | --file\t\t Name (and path) of the xlsx file to process")
    print("-u | --uid\t\t User ID to insert into output")
    print("-g | --gid\t\t Users' Group ID to insert into output")

#Function to laod and parse the Excel file's sheets in to CSV files
def loadFile(fileName):
    inputFile = fileName
    sheets = 0
    numSheets = 0
    print ("Processing Excel File %s" % inputFile)
    try:
        wb = load_workbook(filename = inputFile, data_only=True)
        numSheets = len(wb.get_sheet_names())
        for name in wb.get_sheet_names():
            #Open CSV file for writing
            with open(csvFileName % (outPath, name, uid, gid), 'wb') as f:
                #Tell the CSV writer to use this stream
                csvWriter = csv.writer(f, quotechar="'")
                #Get a sheet from the Excel file
                sheet = wb.get_sheet_by_name(name)
                #Iterate the sheet rows
                for r in sheet.rows:
                    #Define an empty array to store the rows values
                    dataRow = []
                    #Iterate the columns of the row and load then in the array
                    for col in r:
                        dataRow.append(col.value)
                    #Insert the UID and GID values into their positions needed to load
                    dataRow.insert(1, uid) ;
                    dataRow.insert(2, gid) ;
                    #Write the row out to the file
                    csvWriter.writerow(dataRow)
            #Add the sheet name to the list
            tableNames.append(name) ;
    #Catch any of the possible exceptions printing them to the console
    except (CellCoordinatesException, IllegalCharacterError, InsufficientCoordinatesException, InvalidFileException, NamedRangeException, SheetTitleException) as ex:
        print ("Exception (%s) while processing %s in %s" % (ex, name, inputFile))
    #Return the number of sheets found minus those processed 0 == Success
    return (numSheets - len(tableNames))

#Function used to load database, taking a connection object, table name, stream object
def loadDB(dbConn, tableName, fileObject):
    #Define a cursor
    cursor = dataBase.cursor()
    #Delete any existing data from the table that matches the UID/GID pair
    cursor.execute("DELETE FROM Imports.%s WHERE ((uid = %d) AND (gid = %d))" % (tableName, uid, gid))
    #Bulk the data from the CSV file into the database
    cursor.copy_expert(sqlLoad % tableName, fileObject) ;
    #Commit both the delete and load to the database
    dbConn.commit()
    #Count the number of rows inserted
    cursor.execute("SELECT COUNT(1) FROM Imports.%s WHERE ((UID = %d) AND (GID = %d))" % (tableName, uid, gid))
    #Fetch the result of the above COUNT
    rows = cursor.fetchone()
    #Close the cursor
    cursor.close()
    #Return the number of rows counted
    return rows

#Function used to transfer data across schemas
def transferImports(dbConn):
    #Define a cursor
    cursor = dbConn.cursor()
    #Call the stored procedure used to move the data from the Imports to the Application Schema
    cursor.execute("SELECT * FROM transferImported(%d, %d)" % (uid, gid))
    #Fetch the result of the query
    rows = cursor.fetchone()
    #Commit it
    dbConn.commit()
    #Close the cursor
    cursor.close()
    #Return the result
    return rows

#Display the Usage if there are no parameters on the Command Line
if (len(sys.argv) <= 1):
    usage()
    sys.exit(1)
#Read the parameters on the Command line setting out values as needed
try:
    opts, args = getopt.getopt(sys.argv[1:], "hf:s:u:g:r", ["help","file=","server=","uid=", "gid=", "remote"])
    for opt, arg in opts:
        if opt in ("-h","--help"):
            usage()
            sys.exit(0)
        elif opt in ("-f", "--file"):
            fileName = arg
        elif opt in ("-s", "--server"):
            pgServer = arg
        elif opt in ("-u", "--uid"):
            uid = int(arg)
        elif opt in ("-g", "--gid"):
            gid = int(arg)
except getopt.GetoptError as ex:
    print(ex)
    sys.exit(1)

if (fileName == ""):
    print("No File Name")
    usage()
    exit(1)
elif (fileName.endswith(".xls")):
    print("Unsupported Excel format. Please provide an .xlsx format file")
    sys.exit(2)

#Execution starts here so start timer
loadStart = timer()
#Call the Excel reader to create the CSV files 
numSheets = loadFile(fileName)
#Make sure all went well and we have something to load
if ((numSheets == 0) and (len(tableNames) > 0)):
    #Create a connection object to the database
    dataBase = psycopg2.connect(database="*****", user="*****", password="*****", host=pgServer, port="5432")
    #Counter used for total rows processed
    totalRows = 0
    try:
        #Iterate the list of table names and load from CSV
        for table in tableNames:
            #Define variable to store number of returned rows
            rowCount = (0,)
            #Format the file name
            file = csvFileName % (outPath, table, uid, gid)
            #Open the file
            csvFile = open(file, 'ro')
            try:
                #Load the database passing in the conn object, file name and stream object
                rowCount = loadDB(dataBase, table, csvFile)
                totalRows += rowCount[0]
            except (psycopg2, OSError) as e:
                print("Error %s while loading %s", (e, table))
            finally:
                #Close the input file
                csvFile.close()
                #Delete the CSV file regardless
                os.remove(file)
                print("Loaded %d row for %s." % (rowCount[0], table))
        #Move the imported data to the application schema
        rowsTrans = transferImports(dataBase)
    except psycopg2 as e:
        print("Exception %s" % e)
    finally:
        dataBase.close()
    print("Loaded %d rows and Transfered %d Rows" % (totalRows, rowsTrans[0]))
else:
    print("Load failed for %s" % fileName)

loadEnd = timer()
print ("Parsed and loaded %d data sets in %ds" % (len(tableNames), (loadEnd - loadStart)))

 Edit:- Changed arg to int(arg) for the UID / GID CL Paramaters.

Anonymous
Not applicable

Re: Credit where credit’s due (Python)

It’s a bit like eating Pringles, once you start you can’t stop!

I’ve modified the source so it will now process a list of files passed on the command line along with adding a new command line parameter to set the output directory to where the CSV files are written to and a check to ensure it exists.

If you, or anyone, would like a copy please send me a PM, rather than I bloat the thread with source code.

MauriceC
Resting Legend
Posts: 4,085
Thanks: 929
Fixes: 17
Registered: ‎10-04-2007

Re: Credit where credit’s due (Python)

Thanks @Anonymous  been busy on other things today so not even looked at the initial code yet!  Sad  Manana

Superusers are not staff, but they do have a direct line of communication into the business in order to raise issues, concerns and feedback from the community.

Anonymous
Not applicable

Re: Credit where credit’s due (Python)

I don’t know what, if anything, you’ll learn from it as my Python skills are limited, but I am always open to comment both good and bad so if anyone has any thoughts I’d like to hear them. I am of the opinion that learning from ones peers is always a worthwhile exercise.

VileReynard
Hero
Posts: 12,616
Thanks: 582
Fixes: 20
Registered: ‎01-09-2007

Re: Credit where credit’s due (Python)

I've seen a similar, but simpler thing done with BASH. Grin

"In The Beginning Was The Word, And The Word Was Aardvark."

Anonymous
Not applicable

Re: Credit where credit’s due (Python)

You mean like this?

Nano - Shell Loader