Friday 18th April 2014Login | Register
Pages: [1]

CGI: Unix / Windows text file compatability

  • Peter Vaughan
  • Usergroup Member
  • *
  • Posts: 14466
  • Unofficial legendary bright spark bod!
  • View Profile
« on 10/04/2004, 20:51 »
Tutorials and FAQs: CGI: Unix / Windows text file compatability

This tutorial aims to explain the use of Carriage Return (CR) and Line Feed (LF) characters in text based files, and in particular what happens when you transfer files containing them between Windows and Unix/Linux (*nix) based systems. By text based files I mean HTML web pages, perl, php and other scripts and standard .txt files.

CR and LF are traditionally used as line termination characters to identify when one line finishes and another starts. They are also normally invisible when viewing text files because they are considered control characters - i.e. are interpreted differently to normal printable text (a-z, A-Z, 0-9, !"$% etc). The main problem is the two systems (Windows and *nix) treat CR and LF differently and the two methods are not always compatible. Windows terminates lines using a CR and LF character (CR followed by LF), *nix terminates lines with just a single LF character and considers CR as part of the normal text file. Because of this difference transferring files between the systems can have unexpected results, and files like php or perl scripts may run perfectly on one system, but fail or generate errors when run on the other without any obvious difference in the file contents. This difference in determining when a line is terminated (CR/LF or LF only) can also affect text data files transferred between the systems. The scripts may read these data files and lines within the data files may be read differently by the operating system the script is run on.


How do the CR/LF differences cause problems

Here is an example that shows the difference between the two systems - the <CR> and <LF> represent the invisible line termination control characters that exist in the file.

First a Windows file called file1.txt:
Quote
This is a line of text<CR><LF>
This is a line of text<CR><LF>
<CR><LF>
This is the final line of text<CR><LF>


Now the same file in Linux called file2.txt:
Quote
This is a line of text<LF>
This is a line of text<LF>
<LF>
This is the final line of text<LF>


Ok, now to show what happens when a script is written to read each line in turn from the 2 files detailed above and the windows file was transferred to a Linux system 'as is' i.e. containing CR and LF. Note: normally when a line of text is read, the operating system strips off the line terminating characters and only returns the actual line text to the script.

First reading each line of file2.txt (created on Linux)
Quote
line 1 = 'This is a line of text'
line 2 = 'This is a line of text'
line 3 = ''
line 4 = 'This is a line of text'

Now reading each line from file1.txt (created on Windows)
Quote
line 1 = 'This is a line of text<CR>'
line 2 = 'This is a line of text<CR>'
line 3 = '<CR>'
line 4 = 'This is a line of text<CR>'

You will notice the <CR> is actually returned to the script. This is because Linux only strips off what it knows is the line terminating character, LF, and returns all other text including the CR. If the script was not expecting the <CR> to be in the returned text (because it would have been removed if it was run on windows), it may cause the script to fail. Another example may be if the script used the line to match with something else it would match on Windows (because the CR would not be there) but not match on Linux (because the CR is there).

Another example shows what happens to a perl script written on Windows (say in notepad) and transferred to Linux. Perl scripts (like php and shell scripts) are interpreted meaning each line in the script file is read in turn, and it's contents parsed to determine what actions to take. The first line (#) is also important in Linux as it tells the shell what interpreter to use to execute the script.
Quote
#!/usr/bin/perl<CR><LF>
use strict;<CR><LF>
<CR><LF>
if ((! defined $ARGV[0])||($ARGV[0] eq "")) {<CR><LF>
   print<<_END_;<CR><LF>
undos [file]: remove dos line breaks from a script<CR><LF>
_END_<CR><LF>
   exit; <CR><LF>
}<CR><LF>


When run on Windows, as the perl interpreter reads each line in, the CR and LF is stripped off leaving just the command line and the script runs fine. The # line is actually ignored by Windows and just treated as a comment.

When run on Linux, the shell (normally bash) reads the first # line and determines the script should be run by perl. However, because the shell is expecting the script to be in *nix text format the first line actually looks like #!/usr/bin/perl<CR> (perl<carrage return>) and the shell tries to find a programs called perl<CR> which it can't so will fail to run the script.

Quote
username@shellx username $ ./test.pl
bash: ./test.pl: No such file or directory


Looking at this from the other direction, if you wrote a script on Linux (which only contained LF) and transferred it to Windows, it would still run fine even though each line only contained LF as the line terminator. This is because perl is aware of both Linux and Windows text formats so can interpret the lines correctly whether they use LF or CF/LF as the line terminator. The first # is also ignored.

One solution that can be used with perl is to turn on perl's warning mode by using #!/usr/bin/perl -w. This means Perl will be a bit stricter about interpreting your code and warning you when you do silly things, but that's no bad thing and you should be using it anyway. This works because instead of the CR being after perl, it's after the -w option. The shell will look for perl (without a CR) which is will find, and because perl will be interpreting the -w option it will ignore the CR and continue executing the script. For other scripts (shell, php etc) just leaving an extra space after the program may also stop this situation occurring.


How can you solve this compatibility problem

This is actually very easy to do, or more precisely it can be done for you. The most common method for transferring files between systems is to use FTP (File Transfer Protocol). This is a communication program that allows data (files) to be transferred reliably between two systems. FTP has several types or modes of transferring files: Binary, Ascii and sometimes Auto.
  • binary - When a file is transferred, all data in the file is  transferred unchanged. The source and destination files will be exact copies of each other.
  • ascii - When a file is transferred, CR/LF conversion will be performed on the file. This means files copied from Windows to Linux will have the CR character stripped so the remote copy will be in Linux text format. Files copied from Linux to Windows will have a CR inserted before each LF character so the destination copy will be in Windows text format.
  • Auto - This mode tries to determine if the file to be transferred is text (i.e. a script) or non-text (i.e. pdf, zip or raw data). How it determines this may be program dependent, some will look at the file contents and others will look at the file extension. If it determines it's text it uses ascii file transfer otherwise it used binary file transfer. Note: this method may not always work and could result in scripts being transferred in binary so no CR/LF conversion will be performed. So use with caution.
How can I correct files with the wrong CR/LF on Linux

What follows is a perl script called undos, that will remove any CR characters from a file. It basically converts a windows format text file using CR+LF line terminators to a Linux format text file using LF line terminators.

Code:
#!/usr/bin/perl -w
use strict;

if ((! defined $ARGV[0])||($ARGV[0] eq "")) {
print<<_END_;
undos [file]: remove dos line breaks from a script
_END_
exit;
}
if (!-f $ARGV[0]) {
print<<_END_;
$ARGV[0] is not a file!
_END_
exit;
}
open (MODIFY, "+<$ARGV[0]")
or die "Opening: $@\n";
my @file = <MODIFY>;
map { s/\r//g; } (@file);
seek (MODIFY, 0, 0)
or die "Seeking: $@\n";
print MODIFY @file
or die "Printing: $@\n";
truncate (MODIFY, tell(MODIFY))
or die "Truncating: $@\n";
close (MODIFY)
or die "Closing: $@\n";


To get the file into your cgi space use the following commands at a cgi $ prompt:
Code:
$ wget http://www.tutorialsteam.plus.com/cgi/crlf/undos.gz
$ gunzip undos.gz
$ chmod 705 undos

Alternatively highlight and copy perl script text above and write it to a text file on the cgi server. You can do this by creating the file with vi (vi undos), entering insert mode by pressing i, then pasting the copied text into the screen. You then press escape and write the file by typing :wq then set the file permissions using chmod 705 undos as shown above.

You then run it as follows:
Code:
$ ./undos filename

where filename is the name of the windows text file you want to convert. The result will be a file of the same name with the CR characters removed.

Tutorial written by petervaughan
----
This version is based on an original document written by Alex Hudson who gave his permission for it to be used here. The undos script was also written by Alex Hudson with a correction by myself.
plusnet ADSL Customer (PlusNet Pro)
PlusNet Usergroup | PUG Forums | Usertools | PUG Issue Tracker - Please vote!!
Volunteer at the National Museum of Computing @ Bletchley Park
FTTC unlimited syncing @ 77Mbs down / 18Mbs up, Data rate 62Mbs down / 14Mbs up
Logged
« Reply #1 on 21/07/2004, 13:25 »
Just to add to the above. Instead of going to the trouble of downloading the above perl script you could use sed.

sed is a powerful tool in unix text processing, in this context I am going to use it to perform a substition, we are looking for a carriage return character followed by a new line character.

In bash this would look like:

Quote
username@cgi0x username $ sed 's/^M$//' dos-filename.txt > unix-filename.txt

Caveat: When typing this into a bash shell it is very important that you type CTRL-V then CTRL-M. CTRL-V tells bash not to interpret the next character typed at the terminal, for example you can use this to insert an actual tab rather than bash interpreting the TAB key to auto-complete commands.

As an explanation of what is happening I will break down the command:

Firstly to prevent bash from interpreting the commands we open with a single quote, the next character is the actual command, in this case s for substitution, the forward slash character is used to delimit the patterns that we wish to find and replace. Finally we close the single quotes.

The pattern is a regular expression, what we are searching for is a carriage return character (^M) followed immediately by a new line ($) and we will replace that with nothing. Note that this does not replace the new line with nothing it just matches the pattern.
Duncan Scotland
Plusnet Network Operations Engineer
Logged
  • Peter Vaughan
  • Usergroup Member
  • *
  • Posts: 14466
  • Unofficial legendary bright spark bod!
  • View Profile
« Reply #2 on 21/07/2004, 13:42 »
The people who are likely to have this issue, and to who this tutorial is targeted, are those new to FTP and in particular, new to Unix and may be unfamiliar with shell access and entering commands. It is for this reason I created a simple script which can be downloaded and run without the user needing to understand what it does.

There are many ways in which the newlines could be stripped, yours is one such, but explaining about entering control chars and especially understanding regular expressions and having to do it each time would be beyond most newbies and liable to be entered/interpreted wrong.

I appreciate your comments and I'm sure the more experienced uses can use your method.
plusnet ADSL Customer (PlusNet Pro)
PlusNet Usergroup | PUG Forums | Usertools | PUG Issue Tracker - Please vote!!
Volunteer at the National Museum of Computing @ Bletchley Park
FTTC unlimited syncing @ 77Mbs down / 18Mbs up, Data rate 62Mbs down / 14Mbs up
Logged
Pages: [1]
Jump to:  

Related Sites

Community Apps

Here at Plusnet we're always trying to use clever open source things to make our lives easier. Sometimes we write our own and make other people's lives easier too!

View the Plusnet Open Source applications page

About Plusnet

We're a Yorkshire-based provider selling broadband and phone services to homes and businesses throughout the UK. Winner of the ISPA 2010 'Best Consumer Customer Service ISP' Award, we're proud to offer the UK's best value standalone broadband.

© Plusnet plc All Rights Reserved. E&OE

Powered by SMF | SMF © 2006-2008, Simple Machines LLC

Add to Technorati Favourites