Credit where credit’s due (Python)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Other forums
- :
- Tech Help - Software/Hardware etc
- :
- Re: Credit where credit’s due (Python)
Credit where credit’s due (Python)
22-05-2017 4:05 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
I’m a C++ myself and have been for years but I had a need today to process an Excel file, extract the sheets then load each sheets’ data in to a table in my database, so I thought I’d give Python a spin. By the way, there’s no real reason for this post apart from the fact that I know others on the Forum use Python so I thought I’d mention this.
So some time reading and some Python code later and it’s all done! The script can process a small 3.3MB file with 6 sheets of data out to CSVs then load these from CSV in to their own table all in under 9 seconds!
Quite impressed I have to admit as I suspect it would have taken me way more than a few hours to do this is C++ land.
Re: Credit where credit’s due (Python)
22-05-2017 4:12 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Re: Credit where credit’s due (Python)
22-05-2017 4:25 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Form me @chenks76 in this instance not really. Some of the sheets have over 40 columns in them and writing a procedure for them would be tedious, then add in clock cost of the transactional overhead as well. What I needed was a Q'n'D solution and this is what I have and to be fair works really well.
The code bulks the data into tables contained in an import schema and from there via a stored procedure they’re moved to the application schema if all goes well.
Re: Credit where credit’s due (Python)
22-05-2017 5:36 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Interesting @Anonymous. I've got a couple of eXcel problems that need some data manipulation / Migration. I'll give Python a look over.
Maurice
Superusers are not staff, but they do have a direct line of communication into the business in order to raise issues, concerns and feedback from the community.
Re: Credit where credit’s due (Python)
22-05-2017 9:13 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
I am on my iPad (in bed) at the moment @MauriceC, but if you want I will post the script tomorrow if it helps.
Re: Credit where credit’s due (Python)
22-05-2017 9:53 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Thanks @Anonymous though I think mine is a simpler problem than yours. An example script is often easier to follow than a tutorial.
M
Superusers are not staff, but they do have a direct line of communication into the business in order to raise issues, concerns and feedback from the community.
Re: Credit where credit’s due (Python)
23-05-2017 9:03 AM - edited 23-05-2017 9:20 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Well here it is I thought I’d better comment it as well as it can be quite frustrating coming back to the code at a later date and scratching your head when you can’t remember what its doing!
Apologies for the size of the post but I’ve no other way to make it available. You should be made aware that this script uses a PostgreSQL backend but there is no reason why that couldn’t be modified for Oracle, MySQL or SSQ Server.
#!/usr/bin/python import os, sys, csv, getopt, psycopg2 from timeit import default_timer as timer from openpyxl import Workbook, load_workbook from openpyxl.utils.exceptions import CellCoordinatesException, IllegalCharacterError, InsufficientCoordinatesException, InvalidFileException, NamedRangeException, SheetTitleException #Define the Path to write the outputs to outPath = "../CSVs" #Define the SQL used to load the data sqlLoad = """COPY Imports.%s FROM STDIN WITH HEADER DELIMITER ',' CSV""" #Name or IP address of DB Server pgServer = "postgresql" #Custom string to use to format output file names csvFileName = "%s/%s-%d-%d.csv" #List used to store the processed sheet / table names tableNames = list() #Default User and Group IDs of data owner uid = 1 gid = 1 #Variable used for filename passed on CLI fileName = "" def usage(): print("[[-h] | [--help]]\t This help screen") print("[[-s] | [--server]]\t PostgreSQL Server (default 192.168.0.128)") print("-f | --file\t\t Name (and path) of the xlsx file to process") print("-u | --uid\t\t User ID to insert into output") print("-g | --gid\t\t Users' Group ID to insert into output") #Function to laod and parse the Excel file's sheets in to CSV files def loadFile(fileName): inputFile = fileName sheets = 0 numSheets = 0 print ("Processing Excel File %s" % inputFile) try: wb = load_workbook(filename = inputFile, data_only=True) numSheets = len(wb.get_sheet_names()) for name in wb.get_sheet_names(): #Open CSV file for writing with open(csvFileName % (outPath, name, uid, gid), 'wb') as f: #Tell the CSV writer to use this stream csvWriter = csv.writer(f, quotechar="'") #Get a sheet from the Excel file sheet = wb.get_sheet_by_name(name) #Iterate the sheet rows for r in sheet.rows: #Define an empty array to store the rows values dataRow = [] #Iterate the columns of the row and load then in the array for col in r: dataRow.append(col.value) #Insert the UID and GID values into their positions needed to load dataRow.insert(1, uid) ; dataRow.insert(2, gid) ; #Write the row out to the file csvWriter.writerow(dataRow) #Add the sheet name to the list tableNames.append(name) ; #Catch any of the possible exceptions printing them to the console except (CellCoordinatesException, IllegalCharacterError, InsufficientCoordinatesException, InvalidFileException, NamedRangeException, SheetTitleException) as ex: print ("Exception (%s) while processing %s in %s" % (ex, name, inputFile)) #Return the number of sheets found minus those processed 0 == Success return (numSheets - len(tableNames)) #Function used to load database, taking a connection object, table name, stream object def loadDB(dbConn, tableName, fileObject): #Define a cursor cursor = dataBase.cursor() #Delete any existing data from the table that matches the UID/GID pair cursor.execute("DELETE FROM Imports.%s WHERE ((uid = %d) AND (gid = %d))" % (tableName, uid, gid)) #Bulk the data from the CSV file into the database cursor.copy_expert(sqlLoad % tableName, fileObject) ; #Commit both the delete and load to the database dbConn.commit() #Count the number of rows inserted cursor.execute("SELECT COUNT(1) FROM Imports.%s WHERE ((UID = %d) AND (GID = %d))" % (tableName, uid, gid)) #Fetch the result of the above COUNT rows = cursor.fetchone() #Close the cursor cursor.close() #Return the number of rows counted return rows #Function used to transfer data across schemas def transferImports(dbConn): #Define a cursor cursor = dbConn.cursor() #Call the stored procedure used to move the data from the Imports to the Application Schema cursor.execute("SELECT * FROM transferImported(%d, %d)" % (uid, gid)) #Fetch the result of the query rows = cursor.fetchone() #Commit it dbConn.commit() #Close the cursor cursor.close() #Return the result return rows #Display the Usage if there are no parameters on the Command Line if (len(sys.argv) <= 1): usage() sys.exit(1) #Read the parameters on the Command line setting out values as needed try: opts, args = getopt.getopt(sys.argv[1:], "hf:s:u:g:r", ["help","file=","server=","uid=", "gid=", "remote"]) for opt, arg in opts: if opt in ("-h","--help"): usage() sys.exit(0) elif opt in ("-f", "--file"): fileName = arg elif opt in ("-s", "--server"): pgServer = arg elif opt in ("-u", "--uid"): uid = int(arg) elif opt in ("-g", "--gid"): gid = int(arg) except getopt.GetoptError as ex: print(ex) sys.exit(1) if (fileName == ""): print("No File Name") usage() exit(1) elif (fileName.endswith(".xls")): print("Unsupported Excel format. Please provide an .xlsx format file") sys.exit(2) #Execution starts here so start timer loadStart = timer() #Call the Excel reader to create the CSV files numSheets = loadFile(fileName) #Make sure all went well and we have something to load if ((numSheets == 0) and (len(tableNames) > 0)): #Create a connection object to the database dataBase = psycopg2.connect(database="*****", user="*****", password="*****", host=pgServer, port="5432") #Counter used for total rows processed totalRows = 0 try: #Iterate the list of table names and load from CSV for table in tableNames: #Define variable to store number of returned rows rowCount = (0,) #Format the file name file = csvFileName % (outPath, table, uid, gid) #Open the file csvFile = open(file, 'ro') try: #Load the database passing in the conn object, file name and stream object rowCount = loadDB(dataBase, table, csvFile) totalRows += rowCount[0] except (psycopg2, OSError) as e: print("Error %s while loading %s", (e, table)) finally: #Close the input file csvFile.close() #Delete the CSV file regardless os.remove(file) print("Loaded %d row for %s." % (rowCount[0], table)) #Move the imported data to the application schema rowsTrans = transferImports(dataBase) except psycopg2 as e: print("Exception %s" % e) finally: dataBase.close() print("Loaded %d rows and Transfered %d Rows" % (totalRows, rowsTrans[0])) else: print("Load failed for %s" % fileName) loadEnd = timer() print ("Parsed and loaded %d data sets in %ds" % (len(tableNames), (loadEnd - loadStart)))
Edit:- Changed arg to int(arg) for the UID / GID CL Paramaters.
Re: Credit where credit’s due (Python)
23-05-2017 12:20 PM - edited 24-05-2017 8:02 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
It’s a bit like eating Pringles, once you start you can’t stop!
I’ve modified the source so it will now process a list of files passed on the command line along with adding a new command line parameter to set the output directory to where the CSV files are written to and a check to ensure it exists.
If you, or anyone, would like a copy please send me a PM, rather than I bloat the thread with source code.
Re: Credit where credit’s due (Python)
23-05-2017 7:56 PM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
Thanks @Anonymous been busy on other things today so not even looked at the initial code yet! Manana
Superusers are not staff, but they do have a direct line of communication into the business in order to raise issues, concerns and feedback from the community.
Re: Credit where credit’s due (Python)
24-05-2017 8:04 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
I don’t know what, if anything, you’ll learn from it as my Python skills are limited, but I am always open to comment both good and bad so if anyone has any thoughts I’d like to hear them. I am of the opinion that learning from ones peers is always a worthwhile exercise.
Re: Credit where credit’s due (Python)
24-05-2017 11:04 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
I've seen a similar, but simpler thing done with BASH.
"In The Beginning Was The Word, And The Word Was Aardvark."
Re: Credit where credit’s due (Python)
24-05-2017 11:17 AM
- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Report to Moderator
You mean like this?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Plusnet Community
- :
- Forum
- :
- Other forums
- :
- Tech Help - Software/Hardware etc
- :
- Re: Credit where credit’s due (Python)