CSV文件UTF-8(含BOM)到ANSI/Windows-1251

问题描述：

我正在寻找一个批处理文件/宏，以删除自动生成的UTF-8 CSV的第一行并将其转换为Windows代码页1251("ANSI"). 我一直在互联网上搜寻并尝试了很多方法，但是找不到一个可行的方法...

I'm looking to create a batch file / macro to remove the first line of an auto generated UTF-8 CSV and convert it to Windows code page 1251 ("ANSI"). I've been looking all over the internet and tried a lot of things, but just can't find one that works...

删除第一行很简单

@echo off
set "csv=test.csv"
more +1 "%csv%" >"%csv%.new"
move /y "%csv%.new" "export\%csv%" >nul

迷路之后，我尝试使用DOS中的TYPE设置

after that I'm lost, Ive tried using the TYPE set from DOS

cmd /a /c TYPE test.csv > ansi.csv

及其许多变体，但是它要么返回一个空的CP1251文件，要么仅返回另一个UTF文件.

and many variations on this, but it either returns an empty CP1251 file or just another UTF file.

我尝试使用vbs，但这返回了另一个UTF-8文件，但现在没有BOM

I've tried using vbs but this returned another UTF-8 file but now without BOM

Option Explicit

Private Const adReadAll = -1
Private Const adSaveCreateOverWrite = 2
Private Const adTypeBinary = 1
Private Const adTypeText = 2
Private Const adWriteChar = 0

Private Sub UTF8toANSI(ByVal UTF8FName, ByVal ANSIFName)
    Dim strText

    With CreateObject("ADODB.Stream")
        .Open
        .Type = adTypeBinary
        .LoadFromFile UTF8FName
        .Type = adTypeText
        .Charset = "utf-8"
        strText = .ReadText(adReadAll)
        .Position = 0
        .SetEOS
        .Charset = "_autodetect" 'Use current ANSI codepage.
        .WriteText strText, adWriteChar
        .SaveToFile ANSIFName, adSaveCreateOverWrite
        .Close
    End With
End Sub

UTF8toANSI "UTF8-wBOM.txt", "ANSI1.txt"
UTF8toANSI "UTF8-noBOM.txt", "ANSI2.txt"
MsgBox "Complete!", vbOKOnly, WScript.ScriptName

尝试使用vbs转换为iso-8859-1而不是cp1251

tried converting to iso-8859-1 instead of cp1251 using vbs

Option Explicit

Private Const adReadAll = -1
Private Const adSaveCreateOverWrite = 2
Private Const adTypeBinary = 1
Private Const adTypeText = 2
Private Const adWriteChar = 0

Private Sub UTF8toANSI(ByVal UTF8FName, ByVal ANSIFName)
  Dim strText

  With CreateObject("ADODB.Stream")
    .Open
    .Type = adTypeBinary
    .LoadFromFile UTF8FName
    .Type = adTypeText
    .Charset = "utf-8"
    strText = .ReadText(adReadAll)
    .Position = 0
    .SetEOS
    .Charset = "iso-8859-1"
    .WriteText strText, adWriteChar
    .SaveToFile ANSIFName, adSaveCreateOverWrite
    .Close
  End With
End Sub

UTF8toANSI WScript.Arguments(0), WScript.Arguments(1)

但是这也不起作用.

我找到了一种使用stringconverter.exe将文件从UTF转换为ANSI的方法 (从 http://www.computerperformance.co.uk/ezine/tools.htm下载)

EDIT 2: I found a way to convert the files from UTF to ANSI using stringconverter.exe (downloaded from http://www.computerperformance.co.uk/ezine/tools.htm )

Setlocal
Set _source=C:\Users\lloyd.EVD\delFirstBat\import
Set _dest=C:\Users\lloyd.EVD\delFirstBat\export
For /F "Tokens=*" %%I In ('dir /b /a-d "%_source%\*.CSV"') Do stringconverter "%_source%\%%~nxI" "%_dest%\%%~nxI" /ANSI

现在，当我删除文件的第一行时(无论之前还是之后，都没关系)，它再次成为没有BOM的UTF-8.

How ever now when I remove the first line of the file (either before or after, doesn't matter) it becomes a UTF-8 without BOM again.

因此，我现在只需要一个脚本即可在不更改字符集的情况下删除第一行.

So all I should need now is a script to del first row without changing the charset.

答

下一步 VBScript 可能会有所帮助:过程UTF8toANSI将UTF-8编码的文本文件转换为另一种编码.

Next VBScript could help: procedure UTF8toANSI converts a UTF-8 encoded text file to another encoding.

Option Explicit

Private Const adReadAll = -1
Private Const adSaveCreateOverWrite = 2
Private Const adTypeBinary = 1
Private Const adTypeText = 2
Private Const adWriteChar = 0

Private Sub UTF8toANSI(ByVal UTF8FName, ByVal ANSIFName, ByVal ANSICharSet)
  Dim strText

  With CreateObject("ADODB.Stream")
    .Type = adTypeText

    .Charset = "utf-8"
    .Open
    .LoadFromFile UTF8FName
    strText = .ReadText(adReadAll)
    .Close

    .Charset = ANSICharSet
    .Open
    .WriteText strText, adWriteChar
    .SaveToFile ANSIFName, adSaveCreateOverWrite
    .Close
  End With
End Sub

'UTF8toANSI WScript.Arguments(0), WScript.Arguments(1)
UTF8toANSI "D:\test\SO\38835837utf8.csv", "D:\test\SO\38835837ansi1250.csv", "windows-1250"
UTF8toANSI "D:\test\SO\38835837utf8.csv", "D:\test\SO\38835837ansi1251.csv", "windows-1251"
UTF8toANSI "D:\test\SO\38835837utf8.csv", "D:\test\SO\38835837ansi1253.csv", "windows-1253"

有关系统已知的字符集名称的列表，请参阅Windows注册表中HKEY_CLASSES_ROOT\MIME\Database\Charset的子项:

For a list of the character set names that are known by a system, see the subkeys of HKEY_CLASSES_ROOT\MIME\Database\Charset in the Windows Registry:

for /F "tokens=5* delims=\" %# in ('reg query HKCR\MIME\Database\Charset') do @echo "%#"

数据(38835837utf8.csv文件):

1st Line    1250    852 čeština (Čechie)
2nd Line    1251    966 русский (Россия)
3rd Line    1253    737 ελληνικά (Ελλάδα)

输出表明，那些不能转换为特定字符集的字符是使用

Output shows that those characters that can't be converted to a particular character set are converted using Character Decomposition Mapping (č=>c, š=>s, Č=>C etc.); if not applicable then those are all converted to ? question mark (common replacement character):

==> chcp 1250
Active code page: 1250

==> type D:\test\SO\38835837ansi1250.csv
1st Line        1250    852     čeština (Čechie)
2nd Line        1251    966     ??????? (??????)
3rd Line        1253    737     ???????? (??????)

==> chcp 1251
Active code page: 1251

==> type D:\test\SO\38835837ansi1251.csv
1st Line        1250    852     cestina (Cechie)
2nd Line        1251    966     русский (Россия)
3rd Line        1253    737     ???????? (??????)

==> chcp 1253
Active code page: 1253

==> type D:\test\SO\38835837ansi1253.csv
1st Line        1250    852     cestina (Cechie)
2nd Line        1251    966     ??????? (??????)
3rd Line        1253    737     ελληνικά (Ελλάδα)

CSV文件UTF-8(含BOM)到ANSI/Windows-1251

相关推荐