Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
164 views
in Technique[技术] by (71.8m points)

javascript - How can I improve the performance of this CSV parsing code?

I'm trying to parse a large CSV file (this one, to be exact) into a Map from numbers to objects. Because the file is big and may take a while to download, the code parses it while it's still downloading, in order to prevent doing all the work at once, after the download finishes. Here's the code:

const unicodeDataReader = (await fetch("data/ucd/UnicodeData.txt")).body.getReader();

const decoder = new TextDecoder();
let chunk, done;
let codePoint, column = 0, fieldBytes = [], codePointObj = {};
while ({ value: chunk, done } = await unicodeDataReader.read(), !done) {
  for (const byte of chunk) {
    if (byte === 0x3B) { // ;
      const a = new Uint8Array(fieldBytes);
      const field = decoder.decode(a);
      switch (column) {
        case 0:
          codePoint = Number.parseInt(field, 16);
          break;
        case 1:
          codePointObj.name = field;
          break;
        case 2:
          codePointObj.generalCategory = field;
          break;
      }
      fieldBytes.length = 0;
      column++;
    } else if (byte === 0x0A) { // 

      ucd.codePoints.set(codePoint, codePointObj);
      fieldBytes.length = 0;
      column = 0;
      codePointObj = {};
    } else {
      fieldBytes.push(byte);
    }
  }
}

However, this code performs very poorly (even when downloading from localhost) and I don't know why. Chrome DevTools says that the lines that take the most time to execute are:

const a = new Uint8Array(fieldBytes);
const field = decoder.decode(a);

The most weird thing about this is that this similar approach seems to work much better, but it may not work if a character is split between two chunks. (This doesn't happen with this file, because there are only ASCII characters, but I'm planning on adapting this code for other similar files.)

const unicodeDataReader = (await fetch("data/ucd/UnicodeData.txt")).body.getReader();

const decoder = new TextDecoder();
let chunk, done;
let codePoint, column = 0, field = "", codePointObj = {};
while ({ value: chunk, done } = await unicodeDataReader.read(), !done) {
  for (const char of decoder.decode(chunk)) {
    if (char === ";") { // ;
      switch (column) {
        case 0:
          codePoint = Number.parseInt(field, 16);
          break;
        case 1:
          codePointObj.name = field;
          break;
        case 2:
          codePointObj.generalCategory = field;
          break;
      }
      field = "";
      column++;
    } else if (char === "
") { // 

      ucd.codePoints.set(codePoint, codePointObj);
      field = "";
      column = 0;
      codePointObj = {};
    } else {
      field += char;
    }
  }
}

I thought the problem was that decoder.decode() was being called too much, or that maybe creating an Uint8Array was slow, however that doesn't seem to be the issue, as this code runs very fast:

const td = new TextDecoder();
const decoded= [];
for (let i=0;i<10000;i++) {
  // Generate a Uint8Array with 100 random bytes
  const a = new Uint8Array(function*(){for(let i=0;i<100;i++)yield Math.floor(256*Math.random())}());
  const b = [];
  for (const byte of a)
    b.push(byte);
  const c = new Uint8Array(b);
  decoded.push(td.decode(c));
}

How can I improve the performance of my code?

P.S.: I don't have network throttling enabled. The code is slow and the main thread freezes for seconds.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You already test with streams? Look here:
https://www.npmjs.com/package/csv-stream

npm install csv-stream
var csv = require('csv-stream'),
var request = require('request');
 
// All of these arguments are optional.
var options = {
    delimiter : '', // default is ,
    endLine : '
', // default is 
,
    columns : ['columnName1', 'columnName2'], // by default read the first line and use values found as columns
    columnOffset : 2, // default is 0
    escapeChar : '"', // default is an empty string
    enclosedChar : '"' // default is an empty string
}
 
var csvStream = csv.createStream(options);
request('http://mycsv.com/file.csv').pipe(csvStream)
    .on('error',function(err){
        console.error(err);
    })
    .on('header', function(columns) {
        console.log(columns);
    })
    .on('data',function(data){
        // outputs an object containing a set of key/value pair representing a line found in the csv file.
        console.log(data);
    })
    .on('column',function(key,value){
        // outputs the column name associated with the value found
        console.log('#' + key + ' = ' + value);
    })

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...