28} How to convert a file written in IBM PC characters into LATIN1? (And vice versa)
I'll insert here at the top the essence of the lastest solution which
I use in my personal scripts. It utilizes the 32-bit UNIX port tr from
UnxUpdates. I have renamed it
unxtr.exe. Here are my own two subroutines.
(My note: C:\_F\XTOOLS\CSVNAMES.CMD)
:Latin1ToIbm
set filein_=%~1
set fileout_=%~2
set octset_=\206\204\224\201\202\207\221\244\217\216\231\232\220\200\222\245\240\242
"C:\_F\FTOOLS\unxtr.exe" åäöüéçæñÅÄÖÜÉÇÆÑáó %octset_% < "%filein_%" > "%fileout_%"
goto :EOF
:IbmToLatin1
set filein_=%~1
set fileout_=%~2
set octset_=\206\204\224\201\202\207\221\244\217\216\231\232\220\200\222\245\240\242
"C:\_F\FTOOLS\unxtr.exe" %octset_% åäöüéçæñÅÄÖÜÉÇÆÑáó < "%filein_%" > "%fileout_%"
goto :EOF
Previous information:
This is another task, which is best suited for
SED, the Stream EDitor.
@echo off & setlocal enableextensions
::
:: Covert IBM PC characters into LATIN1 characters
:: Requires SED.EXE
::
:: Make a test file with PC characters
echo This is a test file in Finnish (which uses Scandinavian characters)>testin.txt
echo Tämä on testitiedosto ääkösten testaamiseksi.>>testin.txt
echo Lisää: åäö ÅÄÖ>>testin.txt
echo åäöüéçæñÅÄÖÜÉÇÆÑ>>testin.txt
::
:: Optionally, create a SED command file
echo.y/åäöüéçæñÅÄÖÜÉÇÆÑ/\xE5\xE4\xF6\xFC\xE9\xE7\xE6\xF1\xC5\xC4\xD6\xDC\xC9\xC7\xC6\xD1/>"%TEMP%\IBM2LAT1.SED"
::
:: Do the conversion
sed -f"%TEMP%\IBM2LAT1.SED" testin.txt > testout.txt
::
:: See that the result is what one expected
notepad testout.txt
::
:: Clean up
for %%f in ("%TEMP%\IBM2LAT1.SED" testin.txt testout.txt) do (
if exist %%f del %%f)
endlocal & goto :EOF
The contents of testout.txt:
This is a test file in Finnish (which uses Scandinavian characters)
Tämä on testitiedosto ääkösten testaamiseksi.
Lisää: åäö ÅÄÖ
åäöüéçæñÅÄÖÜÉÇÆÑ
A catch if you are using the sed.exe from
sed15x.zip
as mostly in this FAQ. The file names must be in the
SFN 8+3 format. If we use the
GnuWin32 sed.exe instead, that will not pose a problem. In below,
the sed.exe has been renamed as
unxsed.exe
@echo off & setlocal enableextensions
::
:: Covert IBM PC characters into LATIN1 characters
:: Requires SED.EXE (renamed here to UNXSED.EXE)
::
:: Make a test file with PC characters
echo This is a test file in Finnish (which uses Scandinavian characters)>
"testin.txt
"
echo Tämä on testitiedosto ääkösten testaamiseksi.>>
"testin.txt
"
echo Lisää: åäö ÅÄÖ>>
"testin.txt
"
echo åäöüéçæñÅÄÖÜÉÇÆÑ>>
"testin.txt
"
::
:: Another way to build the SED command file
set temp_=%temp%
if defined mytemp if exist
"%mytemp%\
" set temp_=%mytemp%
set sedcmd=%temp_%\sedcmd.tmp
echo s/å/\xE5/g >
"%sedcmd%
"
echo s/ä/\xE4/g >>
"%sedcmd%
"
echo s/ö/\xF6/g >>
"%sedcmd%
"
echo s/ü/\xFC/g >>
"%sedcmd%
"
echo s/é/\xE9/g >>
"%sedcmd%
"
echo s/ç/\xE7/g >>
"%sedcmd%
"
echo s/æ/\xE6/g >>
"%sedcmd%
"
echo s/ñ/\xF1/g >>
"%sedcmd%
"
echo s/Å/\xC5/g >>
"%sedcmd%
"
echo s/Ä/\xC4/g >>
"%sedcmd%
"
echo s/Ö/\xD6/g >>
"%sedcmd%
"
echo s/Ü/\xDC/g >>
"%sedcmd%
"
echo s/É/\xC9/g >>
"%sedcmd%
"
echo s/Ç/\xC7/g >>
"%sedcmd%
"
echo s/Æ/\xC6/g >>
"%sedcmd%
"
echo s/Ñ/\xD1/g >>
"%sedcmd%
"
::
:: Do the conversion
unxsed
--text -f
"%sedcmd%
" "testin.txt
" >
"testout.txt
"
::
:: See that the result is what one expected
notepad
"testout.txt
"
::
:: Clean up
for %%f in (
"%sedcmd%
" "testin.txt
" "testout.txt
") do (
if exist %%f del %%f)
endlocal & goto :EOF
Let's write the solution again with a couple of twists:
@echo off & setlocal enableextensions
::
:: Covert IBM PC characters into LATIN1 characters
:: Requires a TR.EXE Unix port (renamed here to UNXTR.EXE)
::
:: Assign a value for the temporary folder variable temp_
call :AssignTemp temp_
::
:: Make a test file with PC characters
> "%temp_%\testin.txt" echo This is a test file in Finnish (which uses Scandinavian characters)
>>"%temp_%\testin.txt" echo Tämä on testitiedosto ääkösten testaamiseksi.
>>"%temp_%\testin.txt" echo Lisää: åäö ÅÄÖ
>>"%temp_%\testin.txt" echo åäöüéçæñÅÄÖÜÉÇÆÑ
::
:: Which text filtering program to use (UnxUtils tr)
set filter_=
unxtr.exe
call :IsAtPath "%filter_%" file_
if not defined file_ (
echo.
echo File "%filter_%" not found at path or in the current folder
call :CleanUp
goto :EOF)
::
:: Do the conversion (octal coding)
rem å ä ö ü é ç æ ñ Å Ä Ö Ü É Ç Æ Ñ á ó
set octset_=\206\204\224\201\202\207\221\244\217\216\231\232\220\200\222\245\240\242
"%filter_%" %octset_% åäöüéçæñÅÄÖÜÉÇÆÑáó < "%temp_%\testin.txt" > "%temp_%\testout.txt"
::
:: See that the result is what we expected
notepad "%temp_%\testout.txt"
::
call :CleanUp
endlocal & goto :EOF
::
:: ==========================================================
:AssignTemp
setlocal
set return_=%temp%
if defined mytemp if exist "%mytemp%\" set return_=%mytemp%
endlocal & set "%1=%return_%" & goto :EOF
::
:IsAtPath SearchFor found_
setlocal enableextensions disabledelayedexpansion
set found_=
for %%f in ("%~1") do set found_="%%~$PATH:f"
if exist "%~1" set found_="%~1"
if [%found_%]==[""] set found_=
endlocal & set "%~2=%found_%" & goto :EOF
::
:CleanUp
for %%f in ("%temp_%\testin.txt" "%temp_%\testout.txt") do (
if exist %%f del %%f)
goto :EOF
The contents of "%temp_%\testout.txt":
This is a test file in Finnish (which uses Scandinavian characters)
Tämä on testitiedosto ääkösten testaamiseksi.
Lisää: åäö ÅÄÖ
åäöüéçæñÅÄÖÜÉÇÆÑ
The problem can also be solved with a Visual Basic Script.
VBScript has the advantage of being a part of the original XP
command environment. On the other hand the solution is clearly more
complicated than the very simple SED solution. Anyway, first cut
and paste the following script. Name it e.g. IBM2LAT1.VBS and then
call it using
CSCRIPT //NOLOGO "IBM2LAT1.VBS" < "MYIBM.TXT" > "MYLATIN1.TXT"
' IBM2LAT1.VBS by Prof. Timo Salmi
'
' Define the relevant characters
Const IbmChar = "åäöüéçæñÅÄÖÜÉÇÆÑ"
Const Lat1Char = "
"
'
' Define StandardIn and StandardOut
Dim StdIn, StdOut
Set StdIn = WScript.StdIn
Set StdOut = WScript.StdOut
'
' Convert one IBM character to Latin1
Function CharIbm2Lat1(char)
Dim p
p = Instr (1, IbmChar, char, 1)
If p > 0 Then
CharIbm2Lat1 = Mid(Lat1Char, p, 1)
Else
CharIbm2Lat1 = char
End If
End Function
'
' Convert a string
Function Ibm2Lat1(str1)
Dim str2
For i = 1 To Len(str1)
str2 = str2 & CharIbm2Lat1(Mid(str1,i,1))
Next
Ibm2Lat1 = str2
End Function
'
' Convert the input
Dim str
Do While Not StdIn.AtEndOfStream
str = StdIn.ReadLine
StdOut.WriteLine Ibm2Lat1(str)
Loop
It is easy to see that other, similar conversion tasks can be done
with the same methods after just some slight customization. To take
the most obvious example, consider the conversion into the other
direction:
' C:\_F\CMD\IBM2LAT1.VBS by Prof. Timo Salmi
' Usage: CSCRIPT //NOLOGO "LAT12IBM.VBS" < "MYLATIN1.TXT" > "MYIBM.TXT"
'
' Define the relevant characters
Const Lat1Char = "
"
Const IbmChar = "åäöüéçæñÅÄÖÜÉÇÆÑ"
'
' Define StandardIn and StandardOut
Dim StdIn, StdOut
Set StdIn = WScript.StdIn
Set StdOut = WScript.StdOut
'
' Convert one IBM character to Latin1
Function CharLat12Ibm(char)
Dim p
p = Instr (1, Lat1Char, char,
vbBinaryCompare)
if p > 0 Then
CharLat12Ibm = Mid(IbmChar, p, 1)
Else
CharLat12Ibm = char
End If
End Function
'
' Convert a string
Function Lat12Ibm(str1)
Dim str2
For i = 1 To Len(str1)
str2 = str2 & CharLat12Ibm(Mid(str1,i,1))
Next
Lat12Ibm = str2
End Function
'
' Convert the input
Dim str
Do While Not StdIn.AtEndOfStream
str = StdIn.ReadLine
StdOut.WriteLine Lat12Ibm(str)
Loop
ANSI vs. UNICODE
There is another feature that factors in. That is how the new instance of
the Windows XP command interpreter is called. See CMD /? for the
options. Essentially
/A Causes the output of internal commands to a pipe or file to be ANSI
/U Causes the output of internal commands to a pipe or file to be Unicode
It appreas that /A would be the default.
Using the example of this item we could write
@echo off & setlocal enableextensions
::
:: Covert ANSI characters into UNICODE characters
::
:: Make a test file with PC characters (assuming that CMD has been thus invoked)
echo This is a test file in Finnish (which uses Scandinavian characters)>"testin.txt"
echo Tämä on testitiedosto ääkösten testaamiseksi.>>"testin.txt"
echo Lisää: åäö ÅÄÖ>>"testin.txt"
echo åäöüéçæñÅÄÖÜÉÇÆÑ>>"testin.txt"
::
:: Do a conversion to Unicode (/a would be to ANSI)
cmd /u /c type "testin.txt" > "testout.txt"
::
:: See that the result is what one expected
notepad "testout.txt"
::
:: Clean up
for %%f in ("testin.txt" "testout.txt") do if exist %%f del %%f
endlocal & goto :EOF
with Notepad we get
but in (a default opened) CLI we get
The above rises an additonal questions. What actually is in the above
file? In HEX
It
is obvious that there is much padding with the nul 00 characters.
If you wish to filter them, the easiest solution is to use a UNIX tr
port. Let's rename it unxtr.exe for identification. Then
@echo off & setlocal enableextensions
type "testout.txt"|unxtr -d \000
endlocal & goto :EOF
will give
However, if a TR.EXE port is not available, a Visual Basic Script
(VBScript) aided command line script can be applied:
@echo off & setlocal enableextensions
::
:: Build a Visual Basic Script and run it
set vbs_="%temp%\tmp$$$.vbs"
set skip=
findstr "'%skip%VBS" "%~f0" > %vbs_%
cscript //nologo %vbs_%
::
:: Clean up
for %%f in (%vbs_%) do if exist %%f del %%f
endlocal & goto :EOF
'
'The Visual Basic Script
Dim StdIn, StdOut, char, chr0 'VBS
Set StdIn = WScript.StdIn 'VBS
Set StdOut = WScript.StdOut 'VBS
'
chr0 = Chr(0) 'VBS
Do While Not StdIn.AtEndOfStream 'VBS
char = StdIn.Read(1) 'VBS
If char <> chr0 Then 'VBS
StdOut.Write char 'VBS
End If 'VBS
Loop 'VBS
Usage: cmdfaq < "testout.txt"
You might also find of interest the information given by Windows
Character Map C:\WINDOWS\system32\charmap.exe