How to create a simple verilog module

To use C-to-Verilog.com you don't need to learn a new syntax or programming language. Simply write C code. Your C function parameter-list will become your module interface. Consider the problem of calculating the sum of numbers in an array. On an FPGA the array will reside on a BRAM module. To create a hardware circuit, we would write the following code:

 unsigned int sum_array(char* A, unsigned int n) {
	unsigned int sum=0;
	for (unsigned int i=0;i<n; i++) {
		sum += A[i];
	}
	return sum;
} 

The code above will be pipelined so that on each cycle the sum will increase by the correct value.

Where can I find more test programs ?

The CHStone benchmark suite is a collection of programs from various domains, some of which originally belong to other benchmark suites. CHStone can be used to test High-Level Synthesis tools.

How to implement a CRC32 module

The Cyclic Redundancy Check (CRC32) can be used as a checksum to detect the accidental alteration of data during transmission or storage. It is used in Ethernet packets to verify the content and detect errors. It is a common belief that CRC32 is best implemented manually in hardware. There are most certainly optimized CRC32 Verilog modules for FPGAs and ASIC in every Ethernet controller. However, CRC32 is a good use-case for using "C-to-Verilog". It demonstrates usage of memory ports and pipelining of loops.
This code is derived from Zlib:

/* crc32.c -- compute the CRC-32 of a data stream
 * Copyright (C) 1995-1996 Mark Adler
 * For conditions of distribution and use, see copyright notice in zlib.h 
 */

typedef unsigned long ulong;

ulong crc32(ulong crc, unsigned char* buf, ulong offset, ulong len, ulong *crc_table) {
    crc = crc ^ 0xffffffffL;
    for (unsigned int i=0; i<len; i++) {
        crc = crc_table[(crc ^ buf[offset + i]) & 0xff] ^ (crc >> 8);
    }
    return crc ^ 0xffffffffL;
}

CRC32 has an 8-bit lookup table (for 32-bit) words. In our code we declared it as a parameter. This means that this memory will be stored on an external BRAM on your FPGA. As a rule, look-up tables should reside on external BRAMs unless their content can be expressed in logical gates. In our CRC32 module we added an offset parameter. This parameter was added for a reason. We plan to use our function from other C functions and only then synthesize them into hardware. If we want to verify the CRC32 of an Ethernet packet then, we will need to instruct the function to look at the packet in an offset. Since we cannot "cast" pointers to array locations on "C-to-Verilog.com" we will have to pass the function two parameter. The first is the arrays, and the second is the offset of the packet within the array.
You can use this function in your code in addition to other methods if you add "static inline" before the function. This will ensure that the code will be inlined into one module.

How to sort the content of a BRAM

If you want to sort the content of a BRAM, you need to write "C" code in exactly the same way as you would on a CPU. On an FPGA, you would need to connect the BRAM to the first memory port. The second parameter in the code below, N, will be passed to the module and will indicate the range to sort.
Notice that the swap will be twice as fast if you assign two memory ports to the BRAM. However, this code cannot be pipelined and will be mostly serial. If you want to implement a parallel sort then you will need to use several arrays with different memory ports for each of them. This code implements "Bubble sort", which is efficient only for very small sizes.

 void bubble_sort(unsigned int ar[], unsigned int n) {
    unsigned int i,j;
    unsigned int temp;
    //sort!
    for (i=0;i<n;i++) {
        for (j=0;j<n-1;j++) {
            if (ar[j] > ar[j+1]) {
                // swap
                temp = ar[j+1];
                ar[j+1] = ar[j];
                ar[j] = temp;
            }
        }
    }
}

How to parse Ethernet packets

Ethernet (IEEE 802.3) is a family of network protocols, and writing software for parsing Ethernet packets and synthesizing it into hardware can save you time. The same idea can be implemented with other protocols such as IP or UDP (which have similar headers). The TCP/IP protocol is more complicated and the use of an external core or CPU is preferred. In the code below, we defined two functions. The first function checks if the header of the Ethernet packet is correct. It looks only at the preamble and the delimiter of the packet. You can add code to examine the source and destination fields. It's easy to do it in C code. If the packet header is correct, the second function will calculate the packet size and if the size is reasonable then it will copy the data into the 'decoded' buffer. C-to-Verilog will create a module from this C code. Notice that you can easily use the CRC32 function from the previous HOW-TO to verify the content of this packet.

// Offsets of fields inside IEEE 802.3 Packet
#define MAC_SRC 8  	// MAC source
#define MAC_DST 14 	// MAC destination
#define MAC_LEN 20 	// Packet Length
#define MAC_PAYLOAD 22 	// Packet payload
// Protocol constants (ETH header)
#define MAC_PREAMBLE 170
#define MAC_DELIM    171

#define MAX_PACKET_SIZE 1500

#define ERROR 1
#define OKAY  0

typedef unsigned char octate;

static inline int assert_packet_header(octate* packet) {
    for (int i=0; i<7; i++) if (packet[i] != MAC_PREAMBLE) return ERROR;
    if (packet[7] != MAC_DELIM) return ERROR;
    return OKAY;
}

// This method is the entry point to the module. Our module will strip only one
// Ethernet packet, starting at raw_offset. You can write another C function
// to call this function several times over a buffer of packets. 
int strip_one_packet(octate* raw, unsigned int raw_base, octate* decoded, unsigned int decoded_base) {

    if (assert_packet_header(raw)) return ERROR; //Error in packet header

    // Read the packet size
    unsigned int packet_len = (raw[raw_base + MAC_LEN]<<8 )
        + raw[raw_base + MAC_LEN +1]; 

    // We do not support jumbo packets
    if (packet_len > MAX_PACKET_SIZE) return ERROR; 

    // Copy the content of the package into the "decoded" port
    for (int i=0; i<packet_len; i++) {
        decoded[decoded_base++] = raw[raw_base + MAC_PAYLOAD + i]; 
    }
    return OKAY;
} 

How to implement Hadamard Transform

Hadamard transform (Also known as Walsh-Transform) is a generalized class of Fourier transforms. It decomposes an arbitrary input vector into a superposition of Walsh functions. This function is used in signal processing, data compression and video compression applications (MPEG-4). The code below is from the NVidia CUDA SDK. In order to get the best results, assign multiple memory ports to the arrays. This will enable C-to-Verilog to perform multiple reads and multiple writes on each cycle.

 unsigned int fastWalshTransform(unsigned int *h_Output, unsigned int *h_Input, unsigned int log2N){
    const unsigned int N = 1 << log2N;
    for(unsigned int pos = 0; pos < N; pos++) h_Output[pos] = h_Input[pos];

    //Cycle through stages with different butterfly strides
    for(unsigned int stride = N / 2; stride >= 1; stride >>= 1){
        //Cycle through subvectors of (2 * stride) elements
        for(unsigned int base = 0; base < N; base += 2 * stride)
            //Butterfly index within subvector of (2 * stride) size
            for(unsigned int j = 0; j < stride; j++){
                unsigned int i0 = base + j +      0;
                unsigned int i1 = base + j + stride;
                unsigned int T1 = h_Output[i0];
                unsigned int T2 = h_Output[i1];
                h_Output[i1] = T1 - T2;
                h_Output[i0] = T1 + T2;
            }
    }
    return 0;
} 

How to implement wavelet transform

Discrete Wavelet Transform (DWT) has a number of applications in engineering and science. Most notably, it is used for signal coding, to represent a discrete signal in a more redundant form, often as a precondition for data compression. The code below is an ideal candidate for pipelining. The read-source and the write-target are two different arrays. In other words, there is no logical dependency between the two arrays. This means that every cycle one result word will be written to the output buffer.
The Haar wavelet:

 int wavelet_transform(unsigned int* input, unsigned int* output,unsigned int length)  {
    //This function assumes input.length=2^n, n>1
    for (unsigned int len = length >> 1; len > 1 ; len >>= 1) {
        //length=2^n, WITH DECREASING n
        for (unsigned int i = 0; i < len; i++) {
            unsigned int sum = input[i*2]+input[i*2+1];
            unsigned int difference = input[i*2]-input[i*2+1];
            output[i] = sum;
            output[len+i] = difference;
        }
    }
}

How to implement color conversion

There are multiple color representation models. YUV and RGBA are both very popular color models. The ability to convert from one format to another is needed by image and video compressors and display systems. The code below converts from YUV to RGB and it is taken from the libOil software package. The Y,U, and V values are read from three arrays and the RGB values are stored into a single array. The verilog design will create one or more memory ports, depending on your hardware configuration, for each one of the arrays.
In each loop iteration there are several multiplication operations. These operations will be pipelined. They will be executed one after the other, one cycle apart. The multiplication latency is decided by the ALU latency. Multiple ALU units can increase the size of the design but can reduce the time necessary to complete the arithmetic operations.

#define YUV_TO_R_INT(y,u,v) (((y)*256 + 358*((v)-128))>>8)
#define YUV_TO_G_INT(y,u,v) (((y)*256 - 88*((u)-128) - 183*((v)-128))>>8)
#define YUV_TO_B_INT(y,u,v) (((y)*256 + 454*((u)-128))>>8)

void yuv2rgba_u8 (unsigned char *rgbp, unsigned char *yp, unsigned char *up, unsigned char *vp, int n) {
	for(int i=0;i<n;i++){
		rgbp[4*i+0] = YUV_TO_R_INT(yp[i], up[i], vp[i]);
		rgbp[4*i+1] = YUV_TO_G_INT(yp[i], up[i], vp[i]);
		rgbp[4*i+2] = YUV_TO_B_INT(yp[i], up[i], vp[i]);
		rgbp[4*i+3] = 0;
	}
} 

How to get the pop-count of each word in an array

The population count (or pop count) of a machine word is the number of 1's in a word. For example, the number 5 in binary form is 101, which has the pop-count value of 2. Calculating the pop-count of words is used in data coding algorithms. In the code below we have two functions. The first function calculates the pop-count of a single number. The input number is shifted to the right by one in each loop iteration. If the least important bit is equal to one, then we add it to the sum. Notice that we avoid using if-statements. If-statements, especially inside loop, prevent us from doing many optimizations.
The function which calculates the pop-count is decorated with 'static inline'. This is important because otherwise, two verilog functions will be generated. The second function in the code iterates over an array and saves the pop-count result of each element into a second array. This code is pipelined by "C-to-Verilog". The pipeline which is created in hardware is one that takes a word of data from the memory-port and runs it into a funnel of adders. This tree of adders is always full of data. On each loop iteration (of the outer-loop), a new word is stored into the memory. Without pipelining, the calculation would take between 5 and 32 cycles for each word. Pipelining is done automatically by the software and the latency is overlapped. The code below can be compiled and synthesized into an FPGA and used with one or more memory ports per array.

//returns the number of 1's in a word 
static inline unsigned int popCnt(unsigned int input) { 
    unsigned int sum = 0; 
    for (int i = 0; i < 32; i++) {
        sum += (input) & 1; 
        input = input/2; 
    } 
    return sum; 
} 
// This program will put the popcount
// of each word of B[] in A[]
void my_main(unsigned int* A, unsigned int *B, unsigned int N) { 
    for (int i=0; i<N; i++)  A[i] = popCnt(B[i]); 
}

How to implement SHA-1

The SHA hash functions are cryptographic hash functions designed by the NSA. SHA-1 is the most commonly used of the SHA series. SHA-1 can be used in a variety of applications. It is used by the SSH, SSL and IPSEC protocols and can be found in many other modern protocols.
The "Transform" function is the heart of the SHA1 algorithm. In the transform function the second loop cannot be pipelined well because there is a hard data dependency in the W[i] array. All other loops can be pipelined. The rest of the loops in the program have about 20 iterations each. The C-to-Verilog optimizer will try to unroll this code. This will produce a big circuit. If you disable unrolling the compiler will be able to create smaller circuits. In our case pipelining is also not beneficial. Preparing the operation is too expensive for the low trip count.
Set Unroll=0 Pipeline=NO before synthesizing this program.

/* NIST Secure Hash Algorithm */
/* heavily modified by Uwe Hollerbach uh@alumni.caltech edu */
/* from Peter C. Gutmann's implementation as found in */
/* Applied Cryptography by Bruce Schneier */

typedef unsigned int LONG;

/* SHA f()-functions */
#define f1(x,y,z)	((x &y) | (~x & z))
#define f2(x,y,z)	(x ^ y ^ z)
#define f3(x,y,z)	((x & y) | (x & z) | (y & z))
#define f4(x,y,z)	(x ^ y ^ z)

/* SHA constants */
#define CONST1		0x5a827999L
#define CONST2		0x6ed9eba1L
#define CONST3		0x8f1bbcdcL
#define CONST4		0xca62c1d6L

/* 32-bit rotate */
#define ROT32(x,n)	((x << n) | (x >> (32 - n)))

#define FUNC(n,i)						\
    temp = ROT32(A,5) + f##n(B,C,D) + E + W[i] + CONST##n;	\
    E = D; D = C; C = ROT32(B,30); B = A; A = temp

/* do SHA transformation */
int sha_transform(LONG digest[5], LONG data[16], LONG W[80]) {
    int i;
    LONG temp, A, B, C, D, E;
    for (i = 0; i < 16; ++i) {
	W[i] = data[i];
    }
    for (i = 16; i < 80; ++i) {
	W[i] = W[i-3] ^ W[i-8] ^ W[i-14] ^ W[i-16];
	W[i] = ROT32(W[i], 1);
    }
    A = digest[0]; B = digest[1];
    C = digest[2]; D = digest[3]; E = digest[4];
    
    for (i = 0; i < 20; ++i)  { FUNC(1,i); }
    for (i = 20; i < 40; ++i) { FUNC(2,i); }
    for (i = 40; i < 60; ++i) { FUNC(3,i); }
    for (i = 60; i < 80; ++i) { FUNC(4,i); }

    digest[0] += A; digest[1] += B;
    digest[2] += C; digest[3] += D; digest[4] += E;
    return 0;
} 

How to implement GCD

Greatest common divisor (GCD) of two non-zero integers, is the largest positive integer that divides both numbers without remainder. Greatest common divisors can in principle be computed by determining the prime factorizations of the two numbers and comparing factors. This program works on 32 bit numbers only but can be extended to 64 bits.
Set Pipeline=NO before synthesizing this program.

 unsigned int gcd(unsigned int u, unsigned int v)
{
    unsigned int shift;
    /* GCD(0,x) := x */
    if (u == 0 || v == 0)
        return u | v;
    /* Let shift := lg K, where K is the greatest power of 2
     *        dividing both u and v. */
    for (shift = 0; ((u | v) & 1) == 0; ++shift) {
        u >>= 1;
        v >>= 1;
    }   
    while ((u & 1) == 0)
        u >>= 1;
    /* From here on, u is always odd. */
    do {
        while ((v & 1) == 0)  /* Loop X */
            v >>= 1;
        /* Now u and v are both odd, so diff(u, v) is even.
         *            Let u = min(u, v), v = diff(u, v)/2. */
        if (u <= v) {
            v -= u;
        } else {
            unsigned int diff = u - v;
            u = v;
            v = diff;
        }
        v >>= 1;
    } while (v != 0); 
    return u << shift;
}