Daily Learning Notes for February 4, 2016

Today I’ll start writing my very first assembler and hopefully finish the first half of the NAND2Tetris course! I read chapter 6 and thoroughly reviewed the specifications for the week 6 project yesterday, so now I’m ready to get my hands dirty with translating assembly into machine code! Woohoo!

See all the code in my GitHub repository here. (Don’t look at it until after you’ve already completed the project yourself, though!)

Version 1 assembler without symbols

Following the suggestion in the book, I’m first building a version of the assembler that does not handle symbols at all. But where do I start? Ugh, my brain isn’t working today… I already had some caffeine, but it isn’t helping, and I can’t have more caffeine yet because it will probably upset my stomach!

OK, time to get my pen and paper. That usually helps get my brain unstuck.

Stuff I need to do in my assembler:

  • Ignore comments and whitespace
  • Classify command types for A-instructions and C-instructions (then handle user-defined labels in version 2)
  • Extract numbers from A-instructions
  • Convert numbers from decimal to binary
  • Extract fields from C-instructions
  • Convert fields to binary codes
  • Assemble binary codes into machine code instructions, one per line

OK, I’m going to tackle each of these functions one at a time.

Identifying whitespace with JavaScript

I realize that JavaScript probably isn’t the best language for writing an assembler; it really wasn’t designed for this! But I can make it work. I’m pretty rusty on my JavaScript, so this is a good excuse to review. And I totally forgot how to check for whitespace!

First, there’s the trim() method, which conveniently removes whitespace from both ends of a string. But I could write my assembly code like D = D+1; JEQ and it should still work, so I want to remove all whitespace.

To do that, I need regular expressions! The special character \s matches any whitespace character, including tabs and new lines and other stuff like that. I’ll use the replace() method with the global flag to remove every instance of one or more whitespace characters in my string. (To remove a piece of a string, I’m just replacing it with an empty string.)

So here’s my function:

parser.removeWhitespace = function(str) {
    return str.replace(/\s+/g, '');
};

I tested it out in my browser’s console and it works like a charm! This function will definitely come in handy for many future projects. So, note to self: be sure to talk about whitespace and regular expressions in any introductory programming classes I teach!

Identifying comments

In the Hack assembly language, comments begin with // and can appear on their own line or at the end of a line of assembly code. My assembler will have to remove the comments, because comments are only meant for human eyes and have no impact on how the code gets translated into machine language.

I can use the replace() method again with a regular expression to match everything from the // to the end of the string, and just replace it with an empty string to remove it:

parser.removeComments = function(str) {
    return str.replace(/\/\/.*$/g, '');
};

Regular expressions always look ridiculous, haha. I’m using backslashes to escape my forward slashes (\/\/) because JavaScript uses forward slashes to mark the start and end of a regular expression. The .* matches zero or more of any character and the $ anchors it to the end of the string, so this regular expression matches anything beginning with // and going until the end of the string.

OK, that wasn’t so hard after all! I’ve done regular expression before, so it’s good to know that I haven’t forgotten them completely.

Classifying Hack assembly commands

As specified in the book, the Hack computer’s assembly language has A-instructions like @1337 and C-instructions like M=D-1. I need to identify which is which, because that will determine how I parse the instruction code.

Since I’m not worrying about labels and I’ve already stripped out all the whitespace and comments, I think I can get away with classifying the instructions based solely on whether the first character in the string is @ or not:

// takes a command, returns A or C for instruction types. TODO: return L for labels
parser.commandType = function(str) {
    if (str.charAt(0) === '@') {
        return 'A';
    } else {
        return 'C';
    }
};

JavaScript’s charAt() method returns the single character at the given index in the string, index 0 being the first character in the string. So that was easy!

Extracting mnemonics from commands

Next, I have four functions for extracting the assembly code mnemonic out of an instruction string: one for the A-instructions, and one for each of the three fields of a C-instruction.

The first one seems super easy; I can just use the slice() method to chop off the first character:

// takes an instruction string, returns decimal number of A-instruction. TODO: handle symbols and labels
parser.symbol = function(str) {
    return str.slice(1);
};

Extracting fields from a C-instruction won’t be quite so easy, but it shouldn’t be difficult either.

I forgot exactly what the C-instructions can look like, so I dug through the book again and found some clarification:

  • The default form of a C-instruction is dest=comp;jump
  • The dest or jump fields can be empty
  • If the dest field is empty, the = is omitted
  • If the jump field is empty, the ; is omitted

So D=D+1;JEQ is valid, but so is D+1;JEQ and D=D+1. So my parser needs to look for optional equal signs and optional semicolons. Hmm…

Maybe I’m missing something, but this seems very simple: I’m always going to have two or three fields, and they will always be separated by either an equal sign, a semicolon, or one of each. If that’s true, I can just use the split() method a couple times to filter everything out and identify them this way:

// takes an instruction string, returns object identifying its fields
parser.operationFields = function(str) {
    var fields = {};
    var equalsSplit = str.split(/=/);
    if (equalsSplit.length == 1) {
        // no equal sign, so this is a comp;jump code
        var semicolonSplit = equalsSplit[0].split(/;/);
        fields.comp = semicolonSplit[0];
        fields.jump = semicolonSplit[1];
        fields.dest = null;
    } else if (equalsSplit.length == 2) {
        var semicolonSplit = equalsSplit[1].split(/;/);
        if (semicolonSplit.length == 1) {
            // no semicolon, so this is a dest=comp code
            fields.dest = equalsSplit[0];
            fields.comp = semicolonSplit[0];
            fields.jump = null;
        } else if (semicolonSplit.length == 2) {
            // has equals and semicolon, so this is a dest=comp;jump code
            fields.dest = equalsSplit[0];
            fields.comp = semicolonSplit[0];
            fields.jump = semicolonSplit[1];
        }
        
    }
    return fields;
};

This doesn’t seem like the most efficient way to do it, but it makes intuitive sense to me: I split it once to identify one possibility, then split it again to identify the other two possibilities.

Now I should be able to translate the fields into machine code with my other not-yet-written functions like this:

// takes an object containing C-instruction fields, returns full machine code instruction
translator.getCInstructMachineCode = function(fields) {
    var destCode = translator.dest(fields.dest);
    var compCode = translator.comp(fields.comp);
    var jumpCode = translator.jump(fields.jump);
    // append complete C-instruction machine code as a line in the assembler's output string
    return '111' + compCode + destCode + jumpCode;
};

Next, I need to implement those little helper functions to take the Hack assembly mnemonics for destinations, computations and jumps and translate them into machine code.

Translating assembly mnemonics into machine code

They mentioned the idea of a hash table in the book for storing pairs of mnemonics and their associated machine codes, and I think the equivalent data structure in JavaScript is an objec or associative array. So I’ll just copy the codes from the book’s specifications and throw them into my functions like this:

// takes a mnemonic field from C instruction, returns 3-bit binary code for dest
translator.dest = function(field) {
    var dest = {
        'null': '000',
        'M':    '001',
        'D':    '010',
        'MD':   '011',
        'A':    '100',
        'AM':   '101',
        'AD':   '110',
        'AMD':  '111'
    };
    return dest[field];
};

// takes a mnemonic field from C instruction, returns 7-bit binary code for comp
translator.comp = function(field) {
    var comp = {
        '0':    '0101010',
        '1':    '0111111',
        '-1':   '0111010',
        'D':    '0001100',
        'A':    '0110000',
        '!D':   '0001101',
        '!A':   '0110001',
        '-D':   '0001111',
        '-A':   '0110011',
        'D+1':  '0011111',
        'A+1':  '0110111',
        'D-1':  '0001110',
        'A-1':  '0110010',
        'D+A':  '0000010',
        'D-A':  '0010011',
        'A-D':  '0000111',
        'D&A':  '0000000',
        'D|A':  '0010101',
        'M':    '1110000',
        '!M':   '1110001',
        '-M':   '1110011',
        'M+1':  '1110111',
        'M-1':  '1110010',
        'D+M':  '1000010',
        'D-M':  '1010011',
        'M-D':  '1000111',
        'D&M':  '1000000',
        'D|M':  '1010101'        
    };
    return comp[field];
};

// takes a mnemonic field from C instruction, returns 3-bit binary code for jump
translator.jump = function(field) {
    var jump = {
        'null': '000',
        'JGT':  '001',
        'JEQ':  '010',
        'JGE':  '011',
        'JLT':  '100',
        'JNE':  '101',
        'JLE':  '110',
        'JMP':  '111'
    };
    return jump[field];
};

Phew, that was tedious! But very easy.

Next, I’m going to make a little helper function for translating the addresses extracted from A-instructions from deciaml to binary. For this, I’ll use the parseInt() function and the toString() method to convert to decimal and then just pad the string with leading zeros as needed:

// takes a string representing a decimal number, returns a string representing a 16-bit binary number
translator.getBinary16 = function(decimalString) {
    var decimalNum = parseInt(decimalString, 10);
    var binaryString = decimalNum.toString(2);
    if (binaryString.length > 16) {
        // if larger than 16 bits in binary, truncate the string
        return binaryString.slice(0,16);
    }
    // pad with leading zeros if needed
    while (binaryString.length < 16) {
        binaryString = '0' + binaryString;
    }
    return binaryString;
};

Hurray, it works! I forgot how to pad numbers at first, so that took me a couple tries.

Putting it all together!

Now I get to put all these functions to use! Hmm, but now that I think about it, the way I set up all these helper functions as methods of objects won’t play nice with my forEach loop. I really don’t need those extra objects, anyhow. I’ll just rename everything…

OK, everything is renamed. Just a bunch of functions now. Here’s the code for my assembler using all those helper functions:

var assemblerOutput = '';
// run through the assembly code one line at a time
var instructionArray = asmProgram.split('\n');
instructionArray.forEach( function (instruction) {
    instruction = removeWhitespace(instruction);
    instruction = removeComments(instruction);			
    if (commandType(instruction) === 'A') {
        // convert A-instructions to binary representation
        assemblerOutput += getBinary16( getSymbol(instruction) ) + '\n';				
    } else if (commandType(instruction) === 'C') {
        // convert C-instructions to binary representation
        assemblerOutput += getCInstructMachineCode( operationFields(instruction) ) + '\n';
    }
});	
return assemblerOutput;

Time to test it! I’m just going to run this code in my browser’s console by pasting the entire thing in there and typing in a string containing a couple of made-up assembly instructions.

Uh oh! It’s not working as expected. If I run D=D+1;JEQ through my assembler, the output is "1110011111010undefined\n". That isn’t right…

First, a quick fix for one issue: I should strip whitespace from the final string. OK, done.

Now for the weird part: why am I getting “undefined”?! Clearly, I need to throw in some console.log() statements to see what’s going on in there.

15 minutes later: Found it! I made a typo of sorts and was trying to match the jump mnemonic codes against the destination mnemonic codes, and so my helper function was returning undefined because it didn’t find a match in my hash table thingy. Fixed!

Now to test it with the provided test scripts, which include comments and whitespace…

Oh no! It isn’t happy with comments. Well, I need to stop for now and go to an event tonight. I’ll commit my changes and pick up where I left off later.

Learning summary (#TIL)

  • Finished version 1 of my first assembler. It’s broken, but it’s a start!

Next steps

Time breakdown

  • Study time: 3 hours 34 min
    • Working on project: 3 hours 34 min