<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>jonisalonen.com</title>
	<atom:link href="http://jonisalonen.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://jonisalonen.com</link>
	<description>Articles on computing, mathematics, and anything in between</description>
	<lastBuildDate>Wed, 22 May 2013 12:21:45 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Sorting a linked list</title>
		<link>http://jonisalonen.com/2013/sorting-a-linked-list/</link>
		<comments>http://jonisalonen.com/2013/sorting-a-linked-list/#comments</comments>
		<pubDate>Tue, 21 May 2013 23:31:13 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=717</guid>
		<description><![CDATA[For God&#8217;s sake, don&#8217;t try sorting a linked list. &#8211; Steve Yegge Why would it be a bad idea to sort a linked list? A few things come to mind: To locate the kth item you have to traverse the k-1 items that precede it. This makes the formulation of many sorting algorithms unusable for [...]]]></description>
				<content:encoded><![CDATA[<blockquote><p>For God&#8217;s sake, don&#8217;t try sorting a linked list.</p>
<p style="text-align: right;"><a href="http://steve-yegge.blogspot.com.es/2008/03/get-that-job-at-google.html">&#8211; Steve Yegge</a></p>
</blockquote>
<p>Why would it be a bad idea to sort a linked list? A few things come to mind:</p>
<ol>
<li>To locate the kth item you have to traverse the k-1 items that precede it. This makes the formulation of many sorting algorithms unusable for linked lists: the basic &#8220;<code>swap"</code> operation no longer takes constant time.</li>
<li>Linked lists are inefficient in general: going over the items in a list means jumping around memory in random. This is bad for the CPU caches, and it might be faster to copy the linked list into an array, and then sort the array.</li>
</ol>
<p>But let&#8217;s not be discouraged: in defiance of Steve, let&#8217;s sort a linked list, and do it efficiently, with minimal cache misses!</p>
<h3>The Algorithm</h3>
<p>What we can do is use a modified merge sort algorithm. The basic idea behind merge sort is:</p>
<ol>
<li>If the input list is empty or has exactly one element, return because it&#8217;s already sorted.</li>
<li>Otherwise, break the list into two roughly equal sized parts.</li>
<li>Sort the two parts recursively using merge sort.</li>
<li>Merge the two sorted lists into one sorted list.</li>
</ol>
<p>Since we don&#8217;t know how long the list is, breaking the list into two equal parts is not trivial. It can be done in a single pass over the items, but that means traversing many links, each of which is a potential cache miss. Instead we&#8217;ll use the following variation of the algorithm: sort first the first 2 items of the list with the <code>merge</code> operation, then the next 2, merge these so that the first 4 items are sorted, repeat for the next 4 elements and merge so that the first 8 items are sorted, and so on until the end. In pseudocode:</p>
<ol>
<li>k :=1, p := head[list].next</li>
<li>while p ≠ nil:</li>
<li>    Sort k items of the list that starts at p</li>
<li>    Merge two lists of size k that start at head[list] and at p</li>
<li>    k := 2k, p := next item of the last node that was merged</li>
</ol>
<h3>The Implementation</h3>
<p>The above pseudocode looks simple, but can be tricky to implement. For example, you have to remember that when you sort a list its head moves to the middle of the list, so the sort operation has to return the new head of the list somehow. This may seem obvious, but if you start coding without thinking things through it may come as a surprise!</p>
<p>First we&#8217;ll need some representation for linked list nodes. A class that holds the item and a pointer to the next node does nicely:</p>
<pre>    static class Node&lt;T&gt; {
        T value;
        Node&lt;T&gt; next;
    }</pre>
<p>The <code>merge</code> operation as we use it has to return both the first node of the merged list, and the node past the end of the merged list. Let&#8217;s call this construct a <em>bracket</em>:</p>
<pre>    private static class Bracket&lt;T&gt; {
        Node&lt;T&gt; top;
        Node&lt;T&gt; bot;
    }</pre>
<p>Now let&#8217;s define the topmost sort method, following the pseudocode:</p>
<pre>    public static &lt;T&gt; void sort(Comparator&lt;T&gt; cmp, 
                                Node&lt;T&gt; head) {
        Bracket&lt;T&gt; bracket = new Bracket&lt;T&gt;();
        int k = 1;
        Node&lt;T&gt; p = head.next;
        while (p != null) {
            bracket.top = p;
            sort(cmp, k, bracket);
            merge(cmp, head, k, bracket.top, k, bracket);
            head = bracket.top; p = bracket.bot;
            k = 2*k;
        }
        return head;
    }</pre>
<p>What&#8217;s left is the method that sorts the first <code>k</code> items in a linked list with a recursive merge sort, and the merge operation that actually does the work of putting items in order. Both return a bracket from the head of the sorted list until item past the end of the sorted list. No surprises here, pretty standard stuff. I&#8217;ll use whitespace creatively to save space though, so you don&#8217;t have to scroll so much to skip ahead:</p>
<pre>private static &lt;T&gt; void sort(Comparator&lt;T&gt; cmp, int k,
                             Bracket&lt;T&gt; bracket) {
    if (k &lt; 2) { bracket.bot = bracket.top.next; return; }
    sort(cmp, k/2, bracket);
    if (bracket.bot != null) {
        Node&lt;T&gt; top = bracket.top;
        bracket.top = bracket.bot;
        sort(cmp, k/2, bracket);
        Node&lt;T&gt; bot = bracket.top;
        merge(cmp, top, k/2, bot, k/2, bracket);
    }
}

private static &lt;T&gt; void merge(Comparator&lt;T&gt; comparator, 
                              Node&lt;T&gt; top, int ctop, 
                              Node&lt;T&gt; bot, int cbot,
                              Bracket&lt;T&gt; bracket) {
    int count = ctop+cbot;
    Node&lt;T&gt; head=null, tail=null, next;
    while (ctop + cbot &gt; 0) {
        if (cbot == 0)      { next = top; top = top.next; ctop--; }
        else if (ctop == 0) { next = bot; bot = bot.next; cbot--; }
        else {
            int cmp = comparator.compare(top.value, bot.value);
            if (cmp &gt; 0) { next = bot; bot = bot.next; cbot--; }
            else         { next = top; top = top.next; ctop--; }
        }
        if (head == null) head = next;
        else tail.next = next;
        tail = next;
        if (bot == null) cbot=0;
    }
    tail.next = bot;

    bracket.top=head;
    bracket.bot=tail.next;
}</pre>
<h3>The Analysis</h3>
<p>Like any serious comparison-based sorting algorithm, this one makes O(<em>n</em> log <em>n</em>) comparisons between list items when sorting a list of length <em>n</em>. <a href="http://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html">Merge sort can be implemented in constant space</a> as shown by Simon Tatham, but this implementation will end up using O(log <em>n</em>) space due to the use of the stack for recursive calls. How about our goal of minimizing cache misses?</p>
<p>List items are being traversed only in the merge method, so that&#8217;s the only place where cache misses can be a problem. (Let&#8217;s assume for simplicity that the local variables can fit in registers, or never leave the cache). The algorithm will make:</p>
<ul>
<li><em>n</em>/2 merges of size 2, for a total of <em>n</em> potential cache misses</li>
<li><em>n</em>/4 merges of size 4, for a total of <em>n</em> potential cache misses</li>
<li>&#8230;</li>
<li>1 merge of size 2<sup>log<sub>2</sub> <em>n</em></sup>, for a total of <em>n</em> potential cache misses</li>
</ul>
<p>As there are log <em>n</em> levels of merges, in total there are <em>n</em><strong> </strong><strong></strong>log <em>n</em> potential cache misses. Every merge sort on linked lists has to do this same number of merges (with the exception of natural sort perhaps), in addition to any other list traversal they may do, so the algorithm presented here is optimal for CPU cache usage.</p>
<p>Whether this algorithm is measurably better than the many alternatives (see for example <a href="http://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html">Tatham</a>, <a href="http://www.geeksforgeeks.org/merge-sort-for-linked-list/">GeeksForGeeks</a>) remains to be seen.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2013/sorting-a-linked-list/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Counting combinations modulo power of 2</title>
		<link>http://jonisalonen.com/2013/calculating-combinations/</link>
		<comments>http://jonisalonen.com/2013/calculating-combinations/#comments</comments>
		<pubDate>Wed, 20 Mar 2013 23:24:38 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=627</guid>
		<description><![CDATA[It is pretty simple to calculate the number of ways to choose k items out of n: There are n ways to choose the first item, n-1 ways to choose the second item, and so on, until there are (n-k+1) ways to choose the kth item. By the rule of product there are n&#160;(n-1)···(n-k+1) ways [...]]]></description>
				<content:encoded><![CDATA[<p><!-- MathJax --><br />
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script></p>
<p>It is pretty simple to calculate the number of ways to choose k items out of n: There are n ways to choose the first item, n-1 ways to choose the second item, and so on, until there are (n-k+1) ways to choose the kth item. By the <a href)"http://en.wikipedia.org/wiki/Rule_of_product">rule of product</a> there are n&nbsp;(n-1)···(n-k+1) ways to choose the k items in total.</p>
<p>But hold on, this process considers picking first A and then B distinct from picking first B and then A. If you don&#8217;t care about in which order the items are picked you have to remove the duplicates by dividing by the number of ways there are to order k items: this is k! = k·(k-1)···3·2·1. So, the number of ways you can choose k items out of n is:</p>
<p>\[<br />
\binom{n}{k} = \frac{n(n-1) \cdots (n-k+1)}{k(k-1)\cdots 3 \cdot 2 \cdot 1} = \frac{n!}{k!(n-k)!}<br />
\]</p>
<p>Of course, implementing this formula on a computer as it is is not a very good idea. If you use integers, the multiplications rapidly overflow fixed sized integers, and then the division doesn&#8217;t return the correct value because integer <a href="http://jonisalonen.com/2013/mathematical-foundations-of-computer-integers/" title="Mathematics of computer integers">division is not properly defined</a> modulo a power of two. If you use floating point types you&#8217;ll have a different set of problems. So let&#8217;s devise an algorithm that calculates \(\binom{n}{k}\) correctly whenever it fits into a computer word, or in general, modulo a power of two.</p>
<p>We&#8217;ll replace the division with multiplication by the inverse modulo 2<sup>m</sup>. Here it is important to keep in mind that even numbers don&#8217;t have a multiplicative inverse, so we remove any powers of 2 from the numerator and denominator and keep track of the exponent of 2 separately. Done this, we can safely accumulate the denominator and numerator through multiplication and calculate only one modular inverse, in the end.</p>
<p>Without further talk, here&#8217;s a C program that implements the algorithm for 32 bit <code>int</code>s.</p>
<pre>
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;

unsigned int inv(unsigned int n);   /* multiplicative inverse mod 2^32 */
int ord2(int n);                    /* exponent of 2 in n */

unsigned int binom(int n, int k)
{
    unsigned int num=1, den=1, aux, i;
    int ord=0;
    if (n-k < k) k = n-k;
    for (i = 1; i <= k; i++) {
        aux = ord2(i);
        ord -= aux;
        den = den*(i>>aux);

        aux = ord2(n+1-i);
        ord += aux;
        num = num*((n+1-i) >> aux);
    }
    return num*inv(den) << ord;
}

int main(int argc, char *argv[])
{
    int a=3, b=1;
    if (argc > 2) {
        a = atoi(argv[1]);
        b = atoi(argv[2]);
        printf("C(%d,%d) = %u (mod 2^32)\n", a, b, binom(a,b));
    } else  {
        printf("%s a b - calculate C(a,b) mod 2^32", argv[0]);
    }
    return 0;
}

unsigned int inv(unsigned int n) {
    int x = -1;
    x = x*(2-x*n); /* Newton's method */
    x = x*(2-x*n); /* 5 steps enough for 32 bits */
    x = x*(2-x*n);
    x = x*(2-x*n);
    x = x*(2-x*n);
    return x;
}

int ord2(int n) {
    int i, e;
    i = n &#038; -n; /* isolate rightmost 1-bit */
    for (e=0; i > 1; e++) i >>= 1;
    return e;
}
</pre>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2013/calculating-combinations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Infinite integers</title>
		<link>http://jonisalonen.com/2013/infinite-integers/</link>
		<comments>http://jonisalonen.com/2013/infinite-integers/#comments</comments>
		<pubDate>Fri, 01 Mar 2013 19:00:15 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=536</guid>
		<description><![CDATA[Lately we&#8217;ve discussed how integers work in computers when their size is limited to a fixed number of bits. What if there is no such limitation and we can use as many bits as we want? As computers get more and more powerful we&#8217;d like to derive this case by looking at what happens when [...]]]></description>
				<content:encoded><![CDATA[<p><!-- MathJax --><br />
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script></p>
<p>Lately we&#8217;ve discussed how integers work in computers when their size is limited to a fixed number of bits. What if there is no such limitation and we can use as many bits as we want?</p>
<p>As computers get more and more powerful we&#8217;d like to derive this case by looking at what happens when we add more and more bits to the binary representation. For example, if you use 2 bits, -1 = 11<sub>2</sub> because 1+11<sub>2</sub> = 100<sub>2</sub>, and 100<sub>2</sub> ≡ 0 (mod 2²). When you move to 3 bits and more you start to see a pattern emerge:</p>
<pre>
m=3:  -1 =    111 because    111 + 1 =    1000 ≡ 0 (mod 2³) 
m=4:  -1 =   1111 because   1111 + 1 =   10000 ≡ 0 (mod 2⁴) 
m=5:  -1 =  11111 because  11111 + 1 =  100000 ≡ 0 (mod 2⁵)
m=6:  -1 = 111111 because 111111 + 1 = 1000000 ≡ 0 (mod 2⁶)
</pre>
<p>So, if we have an infinite number of bits, would -1 be &#8230;111111<sub>2</sub>? Can a number that extends infinitely to the left, <em>an infinitely large number</em>, have any meaning?</p>
<p>It turns out that, given the right definitions, it can.</p>
<h3>Introduction</h3>
<p>Recall that <a href="http://jonisalonen.com/2013/mathematical-foundations-of-computer-integers/" title="Mathematics of computer integers">m-bit integers are integers modulo 2<sup>m</sup></a>. This means that two m-bit integers are equal if their difference is a multiple of 2<sup>m</sup>; for example -1 is 2<sup>m</sup>-1. When we use m+1 bits this changes: the numbers are equal if their difference is 2<sup>m+1</sup>. To make the jump from m bits to m+1 bits we define a new concept of distance that makes two numbers &#8220;close&#8221; if their difference is a high power of two: the greater the power the closer they are. In fact, if the difference is 2<sup>k</sup>, we make the distance 2<sup>-k</sup>. This distance makes m-bit numbers approximations to m+1 bit numbers: 1111<sub>2</sub> is a good approximation to -1 because their distance is 2<sup>-5</sup> = 1/32. Adding one bit, 11111<sub>2</sub> is even better: its distance to -1 is 2<sup>-6</sup> = 1/64, twice as close.</p>
<p>It turns out that this way of defining a distance between two numbers, though it sounds strange at first, fulfills all the <a href="http://en.wikipedia.org/wiki/Metric_%28mathematics%29">usual requirements</a> we ask of distance functions (metrics in mathematics). Having a metric means it becomes meaningful to talk about an extension of integers where numbers like &#8230;111111<sub>2</sub> arise as limits to sequences such as (1<sub>2</sub>,11<sub>2</sub>,111<sub>2</sub>,1111<sub>2</sub>,&#8230;)<a href="#footnote1"><sup>(*)</sup></a>. We call this extended number set the <strong>2-adic integers</strong><a href="#footnote2"><sup>(*)</sup></a>.</p>
<h3>Arithmetic</h3>
<p>Arithmetic with these &#8220;infinite&#8221; numbers works pretty much like you would expect. You add and multiply finite integers just like before. To add infinite integers you use the exact same methods, but since there&#8217;s an infinite number of digits you can never finish, but you can calculate the result up to any number of digits you like. If this seems objectionable, consider a more familiar case: if you calculate 1/3 in decimal numbers you get 0.333333&#8230;: you never finish, but you can calculate as many digits as you like.</p>
<p>For reference, calculating the sum of two 2-adic is simple:<br />
\[<br />
\sum_{k=0}^\infty a_k 2^k + \sum_{k=0}^\infty b_k 2^k = \sum_{k=0}^\infty (a_k+b_k) 2^k<br />
\]</p>
<p>Calculating the product is slightly more complicated since you have to collect the powers of two:<br />
\[<br />
\left(\sum_{i=0}^\infty a_i 2^i \right)\left(\sum_{i=0}^\infty b_i 2^i\right)<br />
= \sum_{k=0}^\infty \left(\sum_{i+j=k} a_i b_j \right) 2^k<br />
\]</p>
<h3>Example</h3>
<p>To demonstrate, Let&#8217;s calculate the sum of &#8230;111101<sub>2</sub> and 3 = 11<sub>2</sub>, and the product of &#8230;0101011<sub>2</sub> and = 3:</p>
<pre>
  ...111111111111111111111101           ...10101010101010101011
+ ...000000000000000000000011         × ...00000000000000000011
= ...000000000000000000000000         = ...10101010101010101011 
= 0                                   + ...01010101010101010110
                                      = ...00000000000000000001 = 1
</pre>
<p>It turns out that the sum is 0, so &#8230;111101<sub>2</sub> = -3. Also, the product is 1, so in a sense &#8230;0101011<sub>2</sub> = 1/3.</p>
<h3>Representation of negative integers</h3>
<p>In the example we found a way to represent -3 using binary digits. Let&#8217;s see if we can generalize this.</p>
<p>Suppose x is a positive integer and we want to find a representation for -x. One way to approach this problem is using the identity &#8230;111111<sub>2</sub> = -1 from our introduction: -x = -1-(x-1) = (&#8230;111111<sub>2</sub>-x)+1. The subtraction is easy to calculate because all the bits in &#8230;111111<sub>2</sub> are 1: there is never need to borrow.</p>
<p>In fact, you can see &#8230;111111-x as the bitwise complement operation &#8220;~x&#8221; which returns x with all of its bits flipped: where x has a 1, &#8230;111111-x has 1-1 = 0, and where x has a 0, &#8230;111111-x has 1-0 = 1. Therefore the <a href="http://jonisalonen.com/2013/why-we-use-2s-complement/" title="Why We Use 2′s Complement">2&#8242;s complement rule</a> extends to integers in general: again we have -x = ~x+1.</p>
<p>For example, 1234 = 10011010010<sub>2</sub>, so -1234 is &#8230;111101100101110<sub>2</sub>:</p>
<pre>
-1234 =  ...111111111111111                    To verify:
        -...000010011010010 + 1                  ...111101100101110
      =  ...111101100101101 + 1                +        10011010010
      =  ...111101100101110                    =                  0
</pre>
<h3>Practical applications</h3>
<p>Suppose you are to create a program that requires arbitrarily large integers, and for whatever reason don&#8217;t want to use an existing &#8220;bigint&#8221; implementation, such as <code>BigInteger</code> in Java or <abbr>.Net</abbr>. Like the people that design computer hardware, you have to decide how to represent signed integers. I feel that the 2-adic representation is optimal for this purpose because it extends the well known 2&#8242;s complement method in a natural way, and because approaching a problem from a larger perspective often permits you to use more powerful tools, even if the tools make no sense in the problem domain.</p>
<p>This is a common pattern in mathematics: to solve a problem you may have to express it in a more general setting (for example, <a href="http://www.johndcook.com/blog/2013/02/12/generalized-fourier-transforms/">generalized Fourier transforms</a>), and then you <a href="http://www.johndcook.com/blog/2011/01/25/coming-full-circle/">come a full circle</a> when applying the results in the original setting.</p>
<p>One example of this is the calculation of multiplicative inverses. Remember how in m-bit integers every odd number has a multiplicative inverse: for each integer x there is an integer x<sup>-1</sup> that fulfills x<sup>-1</sup>x = 1. Working with m bits, that is, in integers modulo 2<sup>m</sup>, the inverse can be calculated using the <a href="http://en.wikipedia.org/wiki/Extended_Euclidean_algorithm">extended Euclidean algorithm</a>, but if we consider the larger setting of 2-adic integers, a result called Hensel&#8217;s Lemma implies the inverse can be calculated using Newton&#8217;s method much faster than the Euclidean algorithm.</p>
<h4>Footnotes</h4>
<ol>
<li id="footnote1">This is the same process we use to <a href="http://en.wikipedia.org/wiki/Construction_of_the_real_numbers#Construction_from_Cauchy_sequences">define the real numbers</a> using sequences of rational numbers, except that in this case the numbers we construct extend infinitely to the left rather than to the right.</li>
<li id="footnote2">You can do this derivation in any base p, giving the <a href="http://en.wikipedia.org/wiki/P-adic_number"><em>p-adic integers</em></a>. For everything to work correctly p has to be a prime number.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2013/infinite-integers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Why We Use 2&#8242;s Complement</title>
		<link>http://jonisalonen.com/2013/why-we-use-2s-complement/</link>
		<comments>http://jonisalonen.com/2013/why-we-use-2s-complement/#comments</comments>
		<pubDate>Fri, 15 Feb 2013 23:07:02 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=538</guid>
		<description><![CDATA[Previously we discussed the mathematical foundations of computer integers and found that, as far as arithmetic is concerned, we can choose to use any range of N = 2m numbers. Why is it then that in most programming languages we are limited to only two choices of range, called &#8220;signed&#8221; and &#8220;unsigned&#8221;? The problem is [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://jonisalonen.com/2013/mathematical-foundations-of-computer-integers/" title="Mathematical foundations of computer integers">Previously</a> we discussed the mathematical foundations of computer integers and found that, as far as arithmetic is concerned, we can choose to use any range of N = 2<sup>m</sup> numbers. Why is it then that in most programming languages we are limited to only two choices of range, called &#8220;signed&#8221; and &#8220;unsigned&#8221;?</p>
<p>The problem is that often we need operations that can&#8217;t be well defined on general rings. For example, we want to compare two numbers and see which is &#8220;greater.&#8221; In particular we&#8217;re interested in knowing if a given integer is greater or less than zero. These operations have to be easy to implement in hardware if they are to be efficient, so the simpler the better.</p>
<p>Recall that computers represent small negative integers are represented by really large integers. Using m bits for integers, numbers over 2<sup>m-1</sup> have their highest bit set to one. This gives us the idea of identifying the sign with this bit. Now the range of signed integers becomes (-N/2,N/2-1): half are negative, half nonnegative. To check if a number is negative you just have to look at its most significant bit. If you chose to use a different range, like (-100, N-101), you would need to investigate much more than a single bit to decide if a number is less than zero.</p>
<p>As you can read on <a href="http://en.wikipedia.org/wiki/Signed_number_representations">Wikipedia</a> there are many ways to represent signed integers on computers, differentiated by how one obtains the representation of -x from x. Let&#8217;s see how one goes about changing the sign of in our system.</p>
<p>Let&#8217;s denote the number obtained by flipping all bits of x by ~x: where x has a zero, ~x has a one, and where x has a one, ~x has a zero. This means that x + ~x will consist of all ones. But the number whose binary representation consists of all ones is 2<sup>m</sup>-1, so <code>x + ~x = 2<sup>m</sup>-1</code>. From here we obtain <code>-x = ~x-2<sup>m</sup>+1</code>. Since we&#8217;re working mod 2<sup>m</sup>, we have <code>-x = ~x+1</code>. So our derivation of signed integers coincides with the common 2&#8242;s complement representation.</p>
<p>By the logic presented here 2&#8242;s complement is the only sane way of representing signed integers. All other methods break in some way: 1&#8242;s complement for example has two distinct representations for 0, and requires separate circuits for addition and subtraction. Why would anyone bother with any other way of representing numbers?</p>
<p>The <a href="http://en.wikipedia.org/wiki/Signed_number_representations">Wikipedia article</a> seems to suggest the reason is usability: The early programmers had to work with machine code and binary memory dumps all day, and seeing negative numbers as sign-magnitude, or at least as 1&#8242;s complement, would have made their jobs a little bit easier.</p>
<p>In the end the 2&#8242;s complement system won because it was simpler and cheaper to create in hardware. On the other hand usability became a non-issue thanks to programming tools getting better.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2013/why-we-use-2s-complement/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Mathematics of computer integers</title>
		<link>http://jonisalonen.com/2013/mathematical-foundations-of-computer-integers/</link>
		<comments>http://jonisalonen.com/2013/mathematical-foundations-of-computer-integers/#comments</comments>
		<pubDate>Tue, 12 Feb 2013 23:47:52 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=464</guid>
		<description><![CDATA[Integers have typically a fixed size in computers, for example 32 or 64 bits. Since there are no such limits on the size of integers in mathematics, the question arises: how do computer integers work, with respect to mathematical theory? The main problem with having a finite number of bits is that we have a [...]]]></description>
				<content:encoded><![CDATA[<p>Integers have typically a fixed size in computers, for example 32 or 64 bits. Since there are no such limits on the size of integers in mathematics, the question arises: how do computer integers work, with respect to mathematical theory?</p>
<p>The main problem with having a finite number of bits is that we have a finite number of integers, which means that if you keep adding 1s the sequence you get 1,2,3,4,&#8230; at some point starts to repeat. Still we&#8217;d like to define addition and multiplication so that all the familiar rules still hold: we expect that for every a, b, c we have a+(b+c)=(a+b)+c, a*(b*c)=(a*b)*c, a*(b+c)=a*b+a*c, and so on. Mathematical objects that do this are called <a href="http://en.wikipedia.org/wiki/Ring_(mathematics)">rings</a>.</p>
<p>A useful way to define a finite ring of N elements is to  take the integers modulo N. Formally this works by defining an equivalence relation for integers, where <i>a</i> and <i>b</i> are equivalent when their difference is a multiple of N:</p>
<p><i>a</i> ~ <i>b</i> if and only if <i>a</i> &#8211; <i>b</i> ≡ 0 (mod N).</p>
<p>For example, for N = 12, 7 ~ 19 because 7-19 = -12 ≡ 0 (mod 12).</p>
<p>It turns out that this equivalence partitions ℤ into N subsets: those that are multiples of <i>N</i>, those that are a multiple of <i>N</i> plus 1, and so on. These subsets, also called <em>congruence classes</em>, will form the elements of the ring we&#8217;re defining. Now we have to define + and * on them in a meaningful way.</p>
<p>Let&#8217;s denote the congruence class of <i>a</i> by [<i>a</i>]. We have:</p>
<p>[<i>a</i>] = {<i>b</i> : <i>a</i> ~ <i>b</i>} = {<i>a</i> + <i>k</i>N : <i>k</i> ∈ ℤ}. Note that [<i>a</i>] = [<i>a</i> + <i>k</i>N] for every integer <i>k</i>.</p>
<p>For example, for N = 12, [-5] = [7] = [19] = {…, -5, 7, 19, 31, …}.</p>
<p>It turns out we can define addition and multiplication on these classes based on the arithmetic of integers, and everything checks out:</p>
<p>[<i>a</i>] + [<i>b</i>] = [<i>a</i> + <i>b</i>], and [<i>a</i>]*[<i>b</i>] = [<i>ab</i>]. Since [<i>a</i>]=[<i>a</i>+N], these operations &#8220;wrap around&#8221; after N elements.</p>
<p>For example, for N = 12, [7]+[6] = [13] = [1]. [5]*[8] = [40] = [4].</p>
<p>To represent classes we can pick a number from each class: [<i>a</i>] is represented by an arbitrarily chosen <i>a</i>. To encode classes on a computer we pick <i>a</i> from the set {0, 1, …, N-1}, stored in which ever way is appropriate for the machine. This gives us the <strong>unsigned integers</strong>. For example with N = 2<sup>32</sup> we have [2948326455] + [1346640853] = [4294967308] = [12], so a computer working with 32-bit integers calculates 2948326455 + 1346640853 = 12. I&#8217;ll be using N = 2<sup>32</sup> in all of the following examples.</p>
<p>But what about <strong>signed integers</strong>? Since -[<i>a</i>] = [-<i>a</i>] = [N-<i>a</i>], negative integers close to 0 will be stored as large positive numbers. For example we have -[1] = [4294967295], so a computer would store -1 as 4294967295.</p>
<p>Because the arithmetic is defined on congruence classes rather than integers immediately we see that the range of integers we represent on a computer can be chosen arbitrarily. Commonly we choose to work with numbers [0, N-1] or [-N/2, N/2-1], which gives us the unsigned and signed integers, respectively, but we might as well choose to work with any other range that suits our needs, like [-100, N-101], as long as it contains a representative of each of the N classes. As far as the ring operations of addition, subtraction and multiplication are concerned it really doesn&#8217;t matter. (Operations not generally defined on rings, like division or order, are a different matter.)</p>
<p>The benefits of this theoretical approach is that we can immediately apply well known results from mathematics. For example, every integer in {1, 2, …, N-1} has a multiplicative inverse modulo N provided it has no common factors with N. For our computers with N = 2<sup>m</sup> this means division by one odd integer can be performed as multiplication by another. For example, to divide by 3 you can multiply by -1431655765: -1431655765*15 = 5.</p>
<p>Notice that we haven&#8217;t made any assumptions of how computers actually store numbers as bits, or of how signed integers could be encoded. We only assumed that the number of bits available is finite and that addition and multiplication should be defined in a sane way. In upcoming posts I&#8217;ll discuss <a href="http://jonisalonen.com/2013/why-we-use-2s-complement/" title="Why We Use 2′s Complement">how this approach links to the common 2&#8242;s complement representation</a> of signed integers, and <a href="http://jonisalonen.com/2013/infinite-integers/" title="Infinite integers">how the theory behind arbitrary sized integers (bignums) could be explained</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2013/mathematical-foundations-of-computer-integers/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Positive and negative zeros, and MySQL</title>
		<link>http://jonisalonen.com/2013/positive-and-negative-zeros-and-mysql/</link>
		<comments>http://jonisalonen.com/2013/positive-and-negative-zeros-and-mysql/#comments</comments>
		<pubDate>Thu, 17 Jan 2013 13:26:10 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Mathematics]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=466</guid>
		<description><![CDATA[In mathematics zero usually is considered signless; it&#8217;s neither positive nor negative. In computer floating point numbers there are two zeros: one positive, +0.0, and one negative -0.0, at least if we follow the IEEE standard. Most of the time you can&#8217;t tell the difference between the two: As far as basic arithmetics is concerned [...]]]></description>
				<content:encoded><![CDATA[<p>In mathematics zero usually is considered signless; it&#8217;s neither positive nor negative. In computer floating point numbers there are two zeros: one positive, <code>+0.0</code>, and one negative <code>-0.0</code>, at least if we follow the IEEE standard. </p>
<p>Most of the time you can&#8217;t tell the difference between the two: As far as basic arithmetics is concerned both are zero, and the equality operator has been defined so that <code>+0.0 = -0.0</code>. Pretty much the only place where you see a difference between the two is when you divide by zero: <code>1/-0.0</code> results in <em>negative</em> infinity.</p>
<p>The MySQL <a href="http://dev.mysql.com/doc/refman/5.6/en/floating-point-types.html">floating point types</a> also have positive and negative zeros, and it treats them as equals for selecting data:</p>
<pre>
mysql> create table t ( f float );
mysql> insert into t values (0.1), (-0.1);
mysql> update t set f = round(f);
mysql> select f from t where f = 0;
+------+
| f    |
+------+
|    0 |
|   -0 |
+------+
</pre>
<p>However, <code>GROUP BY</code> considers them distinct!</p>
<pre>
mysql> select count(*), from t group by f;
+----------+------+
| count(*) | f    |
+----------+------+
|        1 |    0 |
|        1 |   -0 |
+----------+------+
</pre>
<p>This is something to keep in mind if you are using MySQL floating point types for scientific computing.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2013/positive-and-negative-zeros-and-mysql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Converting decimal numbers to fractions</title>
		<link>http://jonisalonen.com/2012/converting-decimal-numbers-to-ratios/</link>
		<comments>http://jonisalonen.com/2012/converting-decimal-numbers-to-ratios/#comments</comments>
		<pubDate>Sat, 22 Dec 2012 21:22:37 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=469</guid>
		<description><![CDATA[Given that computer numbers have a finite precision, a calculation like 20/3 produces a result like 6.66667, which is slightly off. This begs the question: can we reverse the division operation, to make a routine that outputs &#8220;20/3&#8243; when it&#8217;s given the approximate value? One trick is to take a large number as the denominator. [...]]]></description>
				<content:encoded><![CDATA[<p><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><br />
Given that computer numbers have a finite precision, a calculation like 20/3 produces a result like 6.66667, which is slightly off. This begs the question: can we reverse the division operation, to make a routine that outputs &#8220;20/3&#8243; when it&#8217;s given the approximate value?</p>
<p>One trick is to take a large number as the denominator. For the numerator you multiply the floating point number with that large number and round the result. But this is suboptimal because we would get for example 66667/10000; never the expected 20/3. Can we do any better? </p>
<h4>Yes, yes we can.</h4>
<p><script src="https://gist.github.com/4569508.js">
</script></p>
<p>The best way to do this is with <a href="http://en.wikipedia.org/wiki/Continued_Fraction">continued fractions</a>, that is, fractions of the form:</p>
<p>\[a_0+\cfrac{1}{a_1+\cfrac{1}{a_2+\cfrac{1}{\ddots}}}\]</p>
<p>It turns out that continued fractions allow the construction of a sequence of rational numbers that approximate a given real number <code>x</code> really well. The first approximation is \(a_0\), the next is \(a_0+1/a_1\), the next \(a_0+\frac{1}{a_1+1/a_2}\), and so on. The process to calculate the sequence of a&#8217;s is really simple: \(a_0\) is just the integral part of x, <code>floor(x)</code>. Then we calculate the inverse of the fractional part. The integral part of the result is \(a_1\), and we can use the fractional part to repeat the process. A recurrence relation for the numerator and denominator allows us to update them in each step rather than having to calculate them using the above formula.</p>
<h4>Why it works</h4>
<p>Intuitively you can think of the process of calculating the a&#8217;s as lopping off as much of the error as possible in each step while limiting ourselves to inverses of integers: the error is the fractional part left over in each step. Indeed, for rational numbers eventually we would get fractional part of 0, meaning we have found the continued fraction representation of the number.</p>
<p>Formally, there&#8217;s a theorem that states that each approximation is nearer to the true value than any other fraction whose denominator is less than that of the approximation. This means that the approximations produced by this process are as good as they get: other ratios are either longer or less precise.</p>
<h4>Demo</h4>
<p><script type="text/javascript">
// <![CDATA[
function float2rat(x) {
     var tolerance = 1.e-6;
     var h1=1; var h2=0;
     var k1=0; var k2=1;
     var b = x;
     do {
         a = Math.floor(b);
         var aux = h1; h1 = a*h1+h2; h2 = aux;
         aux = k1; k1 = a*k1+k2; k2 = aux;
         b = 1/(b-a);
     } while (Math.abs(x-h1/k1) > x*tolerance);
    return h1+"/"+k1;
}
// ]]&gt;
</script><label>Convert a decimal number to a fraction: <input type="text" placeholder="6.666667" onchange="alert(float2rat(this.value))" /></label></p>
<h4>Conclusions</h4>
<p>Continued fractions have many interesting properties: Euler used them to prove that <em>e</em> is irrational. The continued fractions that eventually repeat are exactly the quadratic irrationals. The numerators and denominators produced by the process are coprime. The continued fraction representation of the golden ratio \(\varphi\) is all 1&#8242;s, which means it&#8217;s the number that is worst approximated by continued fractions. In a sense this makes \(\varphi\) the most irrational number of all.</p>
<p>You can find out more about continued fractions on the <a href="http://en.wikipedia.org/wiki/Continued_fraction">Wikipedia page</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/converting-decimal-numbers-to-ratios/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>From UTF-16 to UTF-8&#8230; in JavaScript</title>
		<link>http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/</link>
		<comments>http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/#comments</comments>
		<pubDate>Sun, 23 Sep 2012 13:40:34 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Unicode]]></category>
		<category><![CDATA[utf16]]></category>
		<category><![CDATA[utf8]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=428</guid>
		<description><![CDATA[JavaScript Strings are &#8220;finite ordered sequence of zero or more 16-bit unsigned integer values.&#8221; Usually these integer values are UTF-16 code units. The UTF-16 encoding uses one 16-bit unit for Unicode characters from U+0000 to U+FFFF, and two units for characters from U+10000 to U+10FFFF. Unfortunately all the usual String functions length, charAt, charCodeAt, are [...]]]></description>
				<content:encoded><![CDATA[<p>JavaScript Strings are &#8220;finite ordered sequence of zero or more 16-bit unsigned integer values.&#8221; Usually these integer values are UTF-16 code units. The UTF-16 encoding uses one 16-bit unit for Unicode characters from U+0000 to U+FFFF, and two units for characters from U+10000 to U+10FFFF. Unfortunately all the usual String functions <code>length</code>, <code>charAt</code>, <code>charCodeAt</code>, are defined with respect to these code units, so characters such as &#119070; (U+1D11E MUSICAL SYMBOL G CLEF) appear as a pair of surrogate characters. This little detail makes it complicated to operate on Strings.</p>
<p>This little JavaScript function encodes a string as an array of integers using UTF-8 encoding while taking surrogate pairs into account:</p>
<p><script src="https://gist.github.com/joni/3760795.js"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/from-utf-16-to-utf-8-in-javascript/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Did you know this about AUTO_INCREMENT?</title>
		<link>http://jonisalonen.com/2012/did-you-know-this-about-auto_increment/</link>
		<comments>http://jonisalonen.com/2012/did-you-know-this-about-auto_increment/#comments</comments>
		<pubDate>Sat, 22 Sep 2012 06:26:22 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[auto_increment]]></category>
		<category><![CDATA[MySQL]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=441</guid>
		<description><![CDATA[The AUTO_INCREMENT feature of MySQL is the easiest way to generate an automatic sequence of incrementing values for a primary key. It&#8217;s very easy to use, too: create table t ( i integer auto_increment, other_field varchar(100), primary key(id) ); There some features and &#8220;features&#8221; that you have to keep in mind when using AUTO_INCREMENT, though. [...]]]></description>
				<content:encoded><![CDATA[<p>The <code>AUTO_INCREMENT</code> feature of MySQL is the easiest way to generate an automatic sequence of incrementing values for a primary key. It&#8217;s very easy to use, too:</p>
<pre>
create table t (
    i integer auto_increment,
    other_field varchar(100),
    primary key(id)
);
</pre>
<p>There some features and &#8220;features&#8221; that you have to keep in mind when using AUTO_INCREMENT, though.</p>
<h4>AUTO_INCREMENT never gives a value that is less than one already present in the table.</h4>
<p>If you add a row with ID explicitly set to a higher value, the next value produced by AUTO_INCREMENT will be one higher. Yes, this means that there will be gaps. Yes, this means that if for some reason the greatest possible value of the field is inserted, the following inserts will fail:</p>
<pre>
mysql> insert into t values (2147483647);
Query OK, 1 row affected (0.05 sec)
mysql> insert into t values ();
ERROR 1062 (23000): Duplicate entry '2147483647' for key 'PRIMARY'
</pre>
<h4>How to reset the <code>AUTO_INCREMENT</code> counter.</h4>
<p>The values are produced by a counter associated with the table, and the counter only grows. Except that sometimes you may want to reset the counter to some old value.</p>
<pre>
alter table t auto_increment = 0;
</pre>
<p>If you find yourself doing this after emptying a testing table with <code>DELETE</code>, try using the <code>TRUNCATE</code> statement next time. It deletes all the data, and resets the auto increment counter.</p>
<h4>If records are deleted, AUTO_INCREMENT may jump back on server restart.</h4>
<p>In the InnoDB storage engine the auto increment counter is only <a href="http://dev.mysql.com/doc/refman/5.5/en/innodb-auto-increment-handling.html">stored in memory</a>, not on disk. This means that the auto increment counter is initialized to the maximum value the column has when MySQL is restarted. This may be a problem for applications that use one table to archive data that has been deleted from another: suddenly IDs are repeated.</p>
<h4>A table can have only one <code>AUTO_INCREMENT</code> field, and it has to be key.</h4>
<p>The auto increment column does not have to be <em>the</em> primary key; it&#8217;s enough that there&#8217;s an index of some kind on it. It does not even have to be a unique index.</p>
<p>In the MyISAM and BDB storage engines you can have the auto increment column as a secondary column in a multi-column index, and then the auto increment starts from 1 for each group:</p>
<pre>
create temporary table t (
    v varchar(100), 
    i integer(2) auto_increment, 
    primary key(v, i)
) engine=myisam;
insert into t (v) values ('a'),('a'),('b'),('b'),('b');
select * from t;
+---+---+
| v | i |
+---+---+
| a | 1 |
| a | 2 |
| b | 1 |
| b | 2 |
| b | 3 |
+---+---+
</pre>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/did-you-know-this-about-auto_increment/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Java and File Names With Invalid UTF-8</title>
		<link>http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/</link>
		<comments>http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/#comments</comments>
		<pubDate>Fri, 24 Aug 2012 19:06:04 +0000</pubDate>
		<dc:creator>Joni</dc:creator>
				<category><![CDATA[Java]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://jonisalonen.com/?p=416</guid>
		<description><![CDATA[On Unix systems &#8211; Linux and OS X included &#8211; file names can be arbitrary binary data with very few limitations. This means that in order to make sense of the name a character encoding must be used. Recently UTF-8 has become the default encoding on many systems, but sometimes you have to deal with [...]]]></description>
				<content:encoded><![CDATA[<p>On Unix systems &#8211; Linux and OS X included &#8211; file names can be arbitrary binary data with very few limitations. This means that in order to make sense of the name a character encoding must be used. Recently UTF-8 has become the default encoding on many systems, but sometimes you have to deal with files originating from older systems with names in other encodings. These files are a problem for Java programs because <code>java.io</code> treats file names as strings of unicode characters rather than bytes, and is unable to open files with with incorrectly encoded names.</p>
<h3>Example</h3>
<p>This Java program lists files from the current directory and tells you if they exist. It demonstrates that when Java encounters a file with a problematic name it does report it in <code>listFiles</code>, but any further operations on the file fail.</p>
<pre>
import java.io.File;
import java.io.IOException;

class Ls {
    public static void main(String[] args) throws IOException {
        File d = new File(".");
        for (File f : d.listFiles()) {
            System.out.printf("%s: %b\n", f.getName(), f.exists());
        }
    }
}
</pre>
<p>For example, when it encounters a file with a name encoded in latin1, this is what happens: </p>
<pre>
$ ls -b
ni\361o
$ java Ls
ni�: false
</pre>
<p>You can <a href='http://jonisalonen.com/wp-content/uploads/ls.tar'>download Ls.java with an example file</a> here.</p>
<h3>Setting the default character encoding</h3>
<p>You probably know that Java uses a &#8220;default character encoding&#8221; to convert binary data to <code>String</code>s. To read or write text using another encoding you can use an <code>InputStreamReader</code> or <code>OutputStreamWriter</code>. But for data-to-text conversions deep in the API you have no choice but to change the default encoding.</p>
<p>Java reads the default character encoding from the system language settings. On Unix this means <code>LANG</code> and <code>LC_CTYPE</code> environment variables; changing one of these is sufficient. For example, to make Java use latin1 you could start the JVM with the following command:</p>
<pre>$ LANG=en_US.iso88591 java Ls
ni�o: true
</pre>
<p>Or, if you want all programs you start from the terminal to use this locale:</p>
<pre>export LANG=en_US.iso88591</pre>
<p>The locale <code>en_US.iso88591</code> has to be installed on the system for these to work, though. You can use the following command to list locales that are available on your system.</p>
<pre>locale -a</pre>
<h3>Defining and installing a new locale</h3>
<p>If you don&#8217;t have a locale with the appropriate encoding installed you can define and install a new one with the <code>localedef</code> program. For example, to create locale with the <a href="http://en.wikipedia.org/wiki/Windows-1252">Windows Western</a> character encoding you could use the following command.</p>
<pre>sudo localedef -f CP1252 -i en_US en_US.cp1252</pre>
<p>Under this locale Java would correctly process files with all kinds of names, including those whose name contains curly quotes or the euro character €.</p>
<h3>What about <code>file.encoding</code>?</h3>
<p>The <code>file.encoding</code> system property can also be used to set the default character encoding that Java uses for I/O. Unfortunately it seems to have no effect on how file names are decoded into <code>String</code>s.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonisalonen.com/2012/java-and-file-names-with-invalid-characters/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
