How to bind apache server non localhost to tomcat server?

Goal

We would like to make sur tomcat only listen to apache server which was not on localhost adress. It is a security measure to protect the tomcat port.

In order to do that we need to add the attribute “address” to the connector of the tomcat port.
Few words about the connector :

The HTTP Connector element represents a Connector component that supports the HTTP/1.1 protocol. It enables Catalina to function as a stand-alone web server, in addition to its ability to execute servlets and JSP pages.

http://tomcat.apache.org/tomcat-7.0-doc/config/http.html

 
<Connector port="8111"                
               acceptCount="100"     
            address="127.0.0.1" 
               connectionTimeout="5000"               
               keepAliveTimeout="10000"               
               maxKeepAliveRequests="1"               
               maxConnections="10000"               
               protocol="HTTP/1.1"
               />

Few words about the attribute address :

For servers with more than one IP address, this attribute specifies which address will be used for listening on the specified port. By default, this port will be used on all IP addresses associated with the server.

NOTE : this solution uses HTTP protocol connector to connect to apache instead of AJP protocol. The connector AJP should be used between apache and tomcat for performance reason.
https://www.mulesoft.com/tcat/tomcat-connectors

Problem

At first I used the localhost address(127.0.0.1) to make tomcat listening to this address. I wrongly assume the apache server was at the local address server.

Apache and tomcat would start with no errors. However the application would not start. There was no errors meaningful in the application logs, tomcat logs. At last i found some error in apache server error.log :

[Thu Jun 15 17:56:15 2017] [error] (111)Connection refused: proxy: HTTP: attempt to connect to 191.14.12.14:8111 (machine_adress) failed
[Thu Jun 15 17:56:15 2017] [error] ap_proxy_connect_backend disabling worker for (machine_adress)

This error helped to find a solution to this problem.
I checked the IP address 191.14.12.14 and I found out in /etc/hosts that the adress 191.14.12.14 was link to a server called apache_instance1

Solution :

I check in apache configuration “httpd.conf” and I found out that the name of the server is :

ServerName apache_instance1

Therefore to bind tomcat port to listen only to apache server, I had to do modify the attribute adress like this :

 

 
<Connector port="8111"                
               acceptCount="100"     
            address="apache_instance1" 
               connectionTimeout="5000"               
               keepAliveTimeout="10000"               
               maxKeepAliveRequests="1"               
               maxConnections="10000"               
               protocol="HTTP/1.1"
               />

It fixes my problem.

Set up apache reverse proxy with tomcat

This article will overview the relation between Apache HTTP Server and tomcat and also the reverse proxy.

For a long time tomcat/apache was a black-box for me because I did not have to manage it. But few years ago I had the opportunity to gain more knowledge on this subject. The aim of this article is to focus on the big picture of apache/tomcat and present the mod reverse proxy for Apache HTTP server.

What is tomcat ?

Tomcat executes Java servlets and renders Webpages JSP ( Java server page). This guide can help you to understand and run tomcat https://tomcat.apache.org/tomcat-3.2-doc/uguide/tomcat_ug.html

Tomcat is a web server used in the Java world most of the time. It is also easy to use in dev environment with Eclipse for quick testing of JSP/Javascript/HTML/CSS pages.

Install tomcat on linux :
http://www.vogella.com/tutorials/ApacheTomcat/article.html

Why use Apache if tomcat is a web server ?

Apache is more robust for HTML/images static content. For production environment it is necessary to have Apache HTTP web server combined with tomcat for dynamic content(JSP).

https://tomcat.apache.org/tomcat-3.2-doc/tomcat-apache-howto.html

How apache and tomcat communicate together ?

I am not going into details since there is a documentation about it in user’s guide and also here https://tomcat.apache.org/tomcat-3.2-doc/tomcat-apache-howto.html

I am just going through the most important steps briefly and give real world example along the way. For information the example i am giving is in with tomcat version 8.0.23 and apache version Apache/2.2.15 (Unix).

What’s required to pull this off?
Answers to the above three questions!
1. Configure Tomcat
2. Install a web server adapter.
3. Modify Apache’s httpd.conf file.

1.Configure Tomcat

1.1 Modify Tomcat’s server.xml file.
-> Create connectors (HTTP/HTTPS/AJP). A “Connector” represents an endpoint by which requests are received and responses are returned.
->The AJP connector is mechanism by which Tomcat will communicate with Apache.

1.2 Defining a context.
->It is NOT recommended to place elements directly in the server.xml file. Defined in context.xml instead.https://tomcat.apache.org/tomcat-7.0-doc/config/context.html
-> We can also defined here the jdbc configuration used to access the database. Jdbc Example under the context :

The Resources element represents all the resources available to the web application. https://tomcat.apache.org/tomcat-8.0-doc/config/resources.html . Example of resource within the context to mount the web app :

2.Install a web server adapter.

This adapter is not located in apache or tomcat configuration. It answers the question : “How will Apache forward these requests to Tomcat?”.
http://tomcat.apache.org/connectors-doc/webserver_howto/apache.html

mod_jk requires two entities:
mod_jk.xxx – The Apache HTTP Server module, depending on your operating system, it will be mod_jk.so, mod_jk.nlm or MOD_JK.SRVPGM (see the build section).For example in our linux machine.
find / -iname ‘*MOD_JK*’ -print 2>/dev/null
/usr/lib64/httpd/modules/mod_jk-1.2.31-httpd-2.2.x.so
workers.properties – A file that describes the host(s) and port(s) used by the workers (Tomcat processes). A sample workers.properties can be found under the conf directory in the source download.
Also as with other Apache modules, mod_jk should be first installed on the modules directory of your Apache HTTP Server, ie: /usr/lib/apache and you should update your httpd.conf file.
Mod_jk.conf – It is not necessarily needed to make custom changes of this file. There are situation where we need to make changes.

For information workers.properties and Mod_jk.conf are located under our module apache/conf.d

3.Modify Apache’s httpd.conf file.

We need to tell Apache how to load and initialize our adapter, and that certain requests should be handled by this adapter and forwarded onto Tomcat.Tomcat does most of the work for you.

Each time you start Tomcat, after it loads Contexts (both from the server.xml and automatically from $TOMCAT_HOME/webapps), it automagically generates a number of files for you. The two that we’re concerned with are:
tomcat-apache.conf (should really be named mod_jserv.conf-auto)
mod_jk.conf-auto

For example on my latest project our httpd.file we have simply a line to include all configuration files for apache including mod_jk.conf. :

Include conf.d/*.conf

NOTE : For this application we do not use mod_jk.conf-auto but our own custom configuration file mod_jk.conf.

More information at chapter “httpd.conf – Apache’s main configuration file” https://tomcat.apache.org/tomcat-3.2-doc/tomcat-apache-howto.html#httpd

Reverse Proxy in Apache

A reverse proxy (or gateway), appears to the client just like an ordinary web server. No special configuration on the client is necessary. The client makes ordinary requests for content in the namespace of the reverse proxy. The reverse proxy then decides where to send those requests and returns the content as if it were itself the origin.

https://httpd.apache.org/docs/2.4/en/mod/mod_proxy.html#access

We wanted to set up a reverse proxy in order to access to a remote web server running on different machine but on the same network. The idea was to access a servlet running on different application. It would save us duplicating the servlet and a database for our application. Instead of creating something existing we could just reuse an existing web server.

As you have seen previously we have loaded all .conf files in conf.d directory including our reverse_proxy.conf file.

Our configuration file for the reverse proxy is the following :

# Load the proxy module
LoadModule proxy_http_module modules/mod_proxy_http.so

# HTTP
ProxyPass /foo/loadsomeinfo http://192.168.10.1.8080/loadapp
ProxyPassReverse /foo/loadsomeinfo http://192.168.10.1.8080/loadapp

ProxyPass / http://machinea:9000/
ProxyPassReverse / http://machinea:9000/

Apache module mod_proxy :
https://httpd.apache.org/docs/2.4/en/mod/mod_proxy.html#access

Basics knowledge of Apache HTTP Server

I created this article as a reminder about some basics of Apache HTTP Server with real examples. I will present how to start apache,look for logs, and give information about the configuration.

The HTTP server handles request and does the mapping from URL to Filesystem locations. More information at https://httpd.apache.org/docs/2.4/en/urlmapping.html

How do we start Apache ?

On Unix, the httpd program is run as a daemon that executes continuously in the background to handle requests.
https://httpd.apache.org/docs/2.4/en/invoking.html

According to the documentation, httpd should be invoked by a script called apachectl.
https://httpd.apache.org/docs/2.4/en/programs/httpd.html

Let’s see with an example a running apache daemon. It happens I have an apache server on a test machine .

ps -edf | grep httpd
root     23154     1  0 10:44 ?        00:00:00 /usr/sbin/httpd.worker -f /var/apache/conf/httpd.conf -f /var/apache/conf/httpd.conf -k start

This command ps – edf tells us where is located the configuration file used for this apache running. On our machine exists many old install of apache. Therefore this command helps to find the right configuration file used by the current apache server.

One would wonder why httpd.worker is being run and not httpd? What is httpd.worker?
The answer to this question is here :

http://serverfault.com/questions/213956/what-is-the-difference-between-apachectl-and-httpd-worker

Basically at the installation of our product we install a new Apache/tomcat and deploy the application automatically. More information about how it is installed with shell scripts and deployed from Jenkins at
https://julienprog.wordpress.com/2015/08/05/automate-the-installation-of-a-product-with-bash-scripts/

Here are the commands executed by the shell script to launch our Apache server when installing on our machine :

#----------------------------------------------------------------------------------------------------
# Start the Apache http daemon
#----------------------------------------------------------------------------------------------------
start_httpd()
{
    write_log "Starting Apache..."
            export OPTIONS="-f ${ApacheDir}/conf/httpd.conf"
/usr/sbin/apachectl $OPTIONS -k start
ReturnCode=$?
#more code to handle the response
}

Where are the logs?

To my mind it is an important question if we want to troubleshoot problems on the server.
More about logging and how to understand the format of the logs :
http://httpd.apache.org/docs/2.4/en/logs.html

Previously we find out where is the configuration for the httpd running at /var/apache/conf/httpd.conf. If I look into this file i can find where are located the log :

ErrorLog logs/error_log

Thus , on my server, the error log is located at /var/apache/logs/error_log

NOTE : If you have several Virtualhosts you will have one log for each of them.
It happens that on the directory /var/apache/logs there are more logs than just error_log. It is because we have defined some other configuration in /var/apache/conf.d.

For example if i want to find which module generate ssl_error_log, i will execute on the linux machine :

 grep -rnw '/var/apache' -e "ssl_error_log"
/var/apache/conf.d/mod_jk.conf:57:ErrorLog logs/ssl_error_log

Understand Apache Configuration

The following considerations are about about the HTTP Server configuration : “httpd.conf”, and other custom made configuration files.

What is the purpose of DocumentRoot ?

Extract from httpd.conf :

#
# DocumentRoot: The directory out of which you will serve your
# documents. By default, all requests are taken from this directory, but
# symbolic links and aliases may be used to point to other locations.
#
DocumentRoot "/var/apache/www/html"

Files and directories underneath the DocumentRoot make up the basic document tree which will be visible from the web.

Therefore what you see on a website is the tree structure under DocumentRoot :
Then a request for http://www.example.com/fish/ will cause httpd to attempt to serve the file /var/www/html/fish/index.html.

What does IfModule tag in configuration files ?

Did you notice

<IfModule>

tag in configuration files of apache server and wondered what is it exactly ?

In the following example, the MimeMagicFile directive will be applied only if mod_mime_magic is available.

<IfModule mod_mime_magic.c>
    MimeMagicFile "conf/magic"
</IfModule>

https://httpd.apache.org/docs/2.4/en/sections.html

Therefore this tag will apply directives from module.

What is ServerName ?

Extract from httpd.conf :

#
# ServerName gives the name and port that the server uses to identify itself.
# This can often be determined automatically, but we recommend you specify
# it explicitly to prevent problems during startup.
#
# If this is not set to valid DNS name for your host, server-generated
# redirections will not work.  See also the UseCanonicalName directive.
#
# If your host doesn't have a registered DNS name, enter its IP address here.
# You will have to access it by its address anyway, and this will make
# redirections work in a sensible way.
#
ServerName apache-myinstance:8080

Do the following command in a shell terminal to have more information about the IP address of the server name:

cat \etc\hosts

What is doing the directive Listen ?

The Listen directive tells the server to accept incoming requests only on the specified port(s) or address-and-port combinations.
https://httpd.apache.org/docs/2.4/en/bind.html

For example we have on our server used two times the directive for

# Virtual HOST HTTP **
Listen apache-myinstance:8080
# Virtual HOST HTTPS **
Listen apache-myinstance:447

What is a Virtual Host ?

httpd is also capable of Virtual Hosting, where the server receives requests for more than one host. For example on the same server you would have http://www.mywebsite.com and http://www.myotherwebsite.com,etc.. running.

<VirtualHost apache-myinstance:447>

      JkMount /* myworker

ErrorLog logs/ssl_error_log
CustomLog logs/ssl_access.log java_format
LogLevel warn

SSLEngine on
SSLProtocol all -SSLv2
SSLCipherSuite ALL:!ADH:!EXPORT:!SSLv2:RC4+RSA:+HIGH:+MEDIUM:+LOW
SSLCertificateFile /etc/certs/localhost.crt
SSLCertificateKeyFile /etc/certs/localhost.key

<Files ~ "\.(cgi|shtml|phtml|php3?)$">
    SSLOptions +StdEnvVars
</Files>
<Directory "/var/www/cgi-bin">
    SSLOptions +StdEnvVars
</Directory>

SetEnvIf User-Agent ".*MSIE.*"          nokeepalive ssl-unclean-shutdown          downgrade-1.0 force-response-1.0

</VirtualHost>

How to modify your git credentials when cloning from Git Extensions ?

Problem

I installed git extensions on Windows https://gitextensions.github.io/ in order to use git on Windows. I had to clone a repository located on a distant server.

clone_repo

But I used a wrong login (mywronglogin). Here is the error :

"C:\Program Files (x86)\Git\bin\git.exe" clone -v --recurse-submodules --progress --branch develop "https://www.example.com/git/myproject.git" "D:/my_repo_git"
Cloning into 'D:/my_repo_git/myproject'...
fatal: remote error: FATAL: R any myproject mywronglogin DENIED by fallthru
(or you mis-spelled the reponame)

Unfortunately from the user interface of git extensions it is not possible to modify the credentials !

I have done the following without success

  • Modifying login in git extensions UI
  • Looking for the configuration with my wronglogin in C:/Users/mylogin, the directory of git and git extensions
  • Modifying regedit
  • I tried to uninstall git exensions but after reinstall : i got the same issue !
  • Finally I found out about the credential helper for Windows which can cache git credentials.

Solution

I had to unset the cache credential in order to modify this login. Indeed the credentials are stored in the “credentials.helper” to avoid typing them every time. It is a nice feature except when you want you to modify the credentials for some reason.

I had to unset credentials like this :

git config --system --unset credential.helper

http://stackoverflow.com/questions/15381198/remove-credentials-from-git

Now when you will clone the repo you will be asked for the login/password !
If you were not aware of this feature you can spend quite some time on this problem.

How to call asynchronously web servers with Javascript ?

Problem :

We wanted to display or hide information from more than five web servers when loading a homepage. Instead of waiting until all the servers reply to us ( synchronous method) the customer wanted to see the information as soon as the server would reply. This solution presents a better user experience because the webpage is more responsive.

Solution

I will decompose the solution in several parts and explain each steps

  • A HTML/W3.CSS webpage with a button calling the javascript method
  • The Javascript code calling multiple web servers
  • The Javascript class that call a web server
  • The callback method which handle the response of the web server
  • The Javascript function which updates the webpage after obtaining a response from the server.

HTML web page

The html webpage contains a button and a text input. The text input is sent to the web servers when we click the button.

main_webpage
This is a simple example of webpage

The link to my javascript file :

<script type="text/javascript" src="./webserver.js">
<script type="text/javascript" src="./request.js">

When we click on the button we call a javascript function which will use the text content of the input “myinput”.

 <input name="myinput" id="myinput" class="w3-input w3-border w3-light-grey" type="text">
<button class="w3-btn w3-blue-grey" onclick="callWeb()">Load Web Servers</button>

callWeb() is a Javascript function which will retrieve the input value and call web servers.

function callWeb() {
    var input = document.getElementById('myinput'),
       myinput = input.value;
    if (myinput) {
       loadWebServers(input);
    } else {
       alert('Please enter an input!');
       input.focus();
    }
}

To avoid the error “The character encoding of the HTML document was not declared” : I have added these lines in the HTML :

<meta content="text/html;charset=utf-8" http-equiv="Content-Type">
<meta content="utf-8" http-equiv="encoding">

For information in this example I use W3.CSS for the style. It is a modern CSS framework with built-in responsiveness. It is an equivalent of Bootstrap. More information at https://www.w3schools.com/w3css/.

Javascript : Calling multiple web servers

I will present the javascript function which send the request and the callback function which handle the response .

function loadWebServers(input)
{
    var value1 = input
    var value2 = ''
    var value3 = ''
    loadWebABC(value1);
    loadWebX(value1);
}

In this example we call only two web servers. loadWebABC call a web service with XML request. loadWebX call another web service with a JSON request.

function loadWebABC(value1)
{
 updateLoading();
MyClassRequest.sendXmlRequest("http://example.com/service", "myinput=”+ value1, "callbackWebABC");
}

When the server replies, the callback will handle the response :

function callbackWebABC(xml)
{
    updateLoading();
    var rootInfo = xml.getElementsByTagName("root-info");
    //parse more information
   infoextracted=…..
    showOrDisplayInfo(infoextracted);
}

Javascript : Detail of the class that call the web server

XmlHttpRequest is the class used to call distant web servers.
https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest

If you use XMLHttpRequest from an extension, you should use it asynchronously. In this case, you receive a callback when the data has been received, which lets the browser continue to work as normal while your request is being handled.

https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Synchronous_and_Asynchronous_Requests

MyClassRequest is the equivalent of a class in Javascript world. It is defined in a file called request.js :

var MyClassRequest = {
	loading : false,

updaters : new Array(),

	Updater : function(request, backFunction, param, isXml, isJson) {
		this.request = request;
		this.backFunction = backFunction;
		this.param = param;
		this.isXml = isXml;
		this.cancelled = false;
		this.isJson = isJson;
	},

	createRequest : function() {
		var request = false;
		try {
			request = new XMLHttpRequest();
		} catch (trymicrosoft) {
			try {
				request = new ActiveXObject("Msxml2.XMLHTTP");
			} catch (othermicrosoft) {
				try {
					request = new ActiveXObject("Microsoft.XMLHTTP");
				} catch (failed) {
					request = false;
				}
			}
		}
		return request;
	},

	sendXmlRequest : function(url, urlParam, backFunction, param) {
		return this
				.sendRequest(url, urlParam, backFunction, param, true, false);
	},

	sendRequest : function(url, urlParam, backFunction, param, isXml, isJson, timeout) {
		 if (typeof(timeout)==='undefined') {
			 timeout = 0;
		 }
		var request = this.createRequest();

		if (!request) {
			alert("Request not supported");
			return -1;
		} else {
			var index = this.updaters.length;
			this.updaters[index] = new MyClassRequest.Updater(request,
					backFunction, param, isXml, isJson);

			request.open("POST", url, true);
			request.onreadystatechange = MyClassRequest.receiveRequest;
			if (timeout != null && timeout > 0) {
				request.timeout = timeout;
				request.ontimeout = MyClassRequest.timeoutRequest;
			}
			request.setRequestHeader("Content-Type",
					"application/x-www-form-urlencoded; charset=UTF-8");
			request.send(urlParam);

			this.setLoading("true");
			return index;
		}
	},

As you can see the function sendXmlRequest is a sub function of sendRequest. sendRequest will use the object XMLHttpRequest or an equivalent for microsoft supported browser. This important part in this function is :

request.onreadystatechange = MyClassRequest.receiveRequest;

It is receiveRequest function which will trigger the callback function defined previously.

receiveRequest : function() {
		var stillLoading = false;
		for ( var i = MyClassRequest.updaters.length - 1; i >= 0; i--) {
			var updater = MyClassRequest.updaters[i];
			if (updater != null) {
				if (updater.request.readyState == 4) {
					MyClassRequest.updaters[i] = null;
					if (updater.cancelled == false) {
						if (!updater.request.status == 200) {
							alert("No response from server");
						} else {
                                                        if (updater.backFunction) {
							callbackFunction(updater);
                                                        }
						}
					}
				} else {
					stillLoading = true;
				}
			}
			MyClassRequest.setLoading(stillLoading);
		}
	},

callbackFunction is the function which will actually call our callback function when we receive correctly a response form the distant web server.

function callbackFunction(updater)
{

var func = new Function("response", "param",
			updater.backFunction
					+ "(response, param)");
	if (updater.isXml) {
		var xml = updater.request.responseXML;
		if (MyClassRequest.checkErrors(xml)) {
			func(false, updater.param);
		} else {
			func(xml, updater.param);
		}
	} else {
		func(updater.request.responseText,
				updater.param);
	}

}

Javascript : update the webpage

In the previous callback method we use a function showOrDisplayInfo(). This function will display or hide element in the web page. This is a simplified version of the function but as you can see it modifies an HTML element to show or hide this element.

function showOrDisplayInfo(infoextracted) {
for (var infoin infoextracted)
    {
      var idElementToShow ="elementX"+ infoin;
         show( idElementToShow );
    }
}

function show(element)
{
  if (element)
  {
    element.style.display="";
  }
}

function hide(element)
{
  if (element)
  {
    element.style.display="none";
  }
}

Difference between join fetch Hsql and standard SQL request.

Introduction

I stumbled upon a problem with a HSQL request recently(Hibernate SQL language). The application using a Hsql request was not retrieving the information we wanted and we did not know why.

The SQL request retrieving correctly the information

The tables from the database(simplified) :


ALARM                                                         
        ID          NAME    
        122        NAME_X


MY_TABLE
        MY_ID     MY_NAME       VarA 
        231       NAME_Y        value_expected
        Null      NAME_X        value_expected

This SQL request executed from SQL developer would retrieve the expected outcome from the tables.

SELECT a FROM MY_TABLE t, ALARM a WHERE
t.VarA = 'value_expected' and
a.NAME=t.MY_NAME and (a.ID=t.MY_ID or a.ID is null).

Outcome :

Null           NAME_Y

The different behavior with looking similar HSQL request

The HSQL request( used with entities) in the java code was supposed to give the same results as the previous sql request. Here is the code of the hsql request :

StringBuilder clause = new StringBuilder(
"from MyAlarm a join fetch a.mytable t"
+" where t.VarA = 'value_expected' and"
+"(a.ID=t.MY_ID or a.ID is null)");

The problem when using this hsql request is that some of the request is done through entity objects (from MyAlarm a join fetch a.mytabl). The join is done in the entity class of the Entity MyAlarm :

@Entity
@Table(name = "MY_ALARM")
@Cache(usage = CacheConcurrencyStrategy.NONE)
public class MyAlarm implements Serializable {
    @Column(name = "NAME")
    private String Name;
    @Column(name = "ID")
    private String Id;

 @ManyToOne(targetEntity=MyTable.class, optional = true, fetch=FetchType.LAZY)
    <b>@JoinColumns({
        @JoinColumn(name="NAME",  
referencedColumnName="MY_NAME"     
,nullable = true, insertable = false
, updatable = false),
        @JoinColumn(name="ID",    
referencedColumnName="MY_ID"     
,nullable = true
, insertable = false
, updatable = false)</b>
    })

This following join from the entity will never retrieve the row of the table MY_TABLE where ID is null because there is no ID null in the table ALARM. Therefore this hsql request is not not equivalent to the SQL request above.

Explanation about join fetch

The documentation about join fetch : “14.3. Associations and joins”
https://docs.jboss.org/hibernate/orm/3.3/reference/en/html/queryhql.html

A good explanation of the difference between join and join fetch here :
http://stackoverflow.com/questions/17431312/difference-between-join-and-join-fetch-in-hibernate

How did I find the solution ?

As i said previously unit testing is the key to find the resolution of coding problems. It enables you to reproduce problems faster and it adds a regression test to your product. More information about unit testing https://julienprog.wordpress.com/2015/03/15/the-power-of-unit-testing/

I reproduced the problem inside a unit test.It helped me a lot to understand the problem and how to solve it. To realise a unit test of hibernate request i use hsqldb,spring framework ,maven,etc..

The solution

I join the two tables “MyAlarm” and “MyTable” with the following conditions. The important part was to add (a.ID=t.MY_ID or a.ID is null). It is a condition necessary to respect the needs of the customer.

<b>select a from MyAlarm a, MyTable t</b> where 
t.VarA = 'value_expected' and
a.NAME=t.MY_NAME (a.ID=t.MY_ID or a.ID is null)

Basic automatic testing of machine learning algorithm in Python

Introduction

I will present a basic solution to realize automatic testing for machine learning algorithm. There is many languages used for machine learning. Python is one of the most popular language for machine learning.It is not the fastest or the easiest language but it is a general purpose language that does a bit of everything.

I am gonna use the machine learning algorithm made by Michael E Nielsen http://neuralnetworksanddeeplearning.com. It is coded in Python. This ebook explains Neural Networks and deep learning with code examples. That’s a really good article to start learning machine learning.

This article explains a machine learning algorithm (neural networks and deep learning) . Michael E Nielsen uses these algorithms to resolve the problem of recognizing handwritten numbers.

Solution

I will show the tools i use to code and test in Python language. Then i will present some of my basic unit test to test the code source of Michael E Nielsen. I will also present a short unit test for Stochastic Gradient Descent. Also i will show my solution to launch all Python tests from Jenkins

Develop and test Python code in Eclipse

There are many IDE to code in Python. I chose to use Eclipse with Pydev plugin. I used it because it is free and easy to use. I also use “git” for source control.

Once you have installed Pydev plugin in eclipse, you need to configure eclipse if you would like to run unit tests inside it. One of my problem was to run unit tests inside eclipse.

project-neural
Neural networks project in Eclipse from the ebook

As you can see from the picture , my unit tests are in a folder called “test”. The “src” folder contains the source code of the neural networks algorithm. My initial problem was that my unit tests could not import the source code. For example I could not import the class Network.

The solution for this problem was to configure eclipse Pydev plugin  like in this link. http://stackoverflow.com/questions/4631377/unresolved-import-issues-with-pydev-and-eclipse

Go to the pane of “PyDev – PYTHONPATH” of the python project and add your source code in external libraries.

configure_eclipse_pydev
Configure Pydev to launch unit tests inside Eclipse

Now i can launch the test inside Eclipse.

Create Basic Unit Tests with Python

Assuming you are using version 3.7 you should inform yourself about unittest package from the Python manual  https://docs.python.org/3.7/library/unittest.html .

This is the class i would like to test :

class Network(object):

def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

I am gonna test only few parts of the class Network and run a testcase from the chapter 1 http://neuralnetworksanddeeplearning.com/chap1.html.

This is my test class which test the Network class.

import unittest
import network
import mnist_loader



class test_network(unittest.TestCase):
    
    
    def testCaseRecognizeHandWrittenDigits(self):
        #loading the MNIST data
        training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
        #set up a Network with 30 hidden neurons
        net = network.Network([784, 30, 10])
        #Finally, we'll use stochastic gradient descent to learn 
        #from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of eta=3.0, 
        epochs = 3#30
        net.SGD(training_data, epochs, 10, 3.0, test_data=test_data)
        

    
    def testnetwork(self):
        print "init network"
        size = [784, 30, 10]
        net = network.Network(size)
        
        # verify the number of items in the collection size
        self.assertEqual(net.num_layers, 3)  

The test “testCaseRecognizeHandWrittenDigits” just launch one testcase of the chapter 1. It does not verify anything. It is checking if everything is compiling but we have no idea if the code is doing something useful.

The test “testnetwork” is a unit test for the object Network. We verify that the number of items is correct. When i launch the tests from eclipse the results are OK :

run_test_network
Run as Python Unittest from Eclipse

As you can see from the previous testcase i just run SGD over three epochs instead of 30.SGD is the method which implements stochastic gradient descent.

System Test of Stochastic Gradient Descent algorithm

Now i am gonna test SGD of the object Network2 of the code source.

In practice, stochastic gradient descent(SGD) is a commonly used and powerful technique for learning in neural networks, and it’s the basis for most of the learning techniques we’ll develop in this book.

The unit test I have created will tell us if the algorithm detect  handwritten numbers with more than 90% accuracy.

class test_network2(unittest.TestCase):
        
    def testCaseRecognizeHandWrittenDigits(self):
        #loading the MNIST data
        training_data, validation_data, test_data = mnist_loader.load_data_wrapper()
        #set up a Network with 30 hidden neurons
        net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
        net.large_weight_initializer()
        #We set the learning rate to eta and we train for 3 epochs
        epochs = 3# for speed i chose 3 instead of 30
        global_evaluation_data = net.SGD(training_data, epochs, 10, 0.5, evaluation_data=test_data, monitor_evaluation_cost=True,monitor_evaluation_accuracy=True,monitor_training_cost=True, monitor_training_accuracy=True)
        total_accuracy_training_data = global_evaluation_data[1]
        
        # verify that accuracy of training data for all epoch is superiori to 90%
        for accuracy in total_accuracy_training_data :
            print accuracy
            percentage = accuracy / 10000.0
            print percentage
            self.assertGreater(percentage, 0.9, "accuracy must be superior to 90 percent for all epoch")

As you can see from the code the methode SGD returns the accuracy of the training data and also evaluation data. In the code I am just verifying that for all epochs the accuracy of the detection of handwritten images is superior to 90%. And this verification is done just for training data.

This is just an example of unit test for machine learning algorithm. By testing everyday this testcase with Jenkins , we verify that any modification in our algorithm won’t diminish the accuracy of the detection.

Here a caption of the result when i run the unit test from eclipse :

eclipse_unit_test_results
The accuracy of the training data for all epochs is superior to 90%

How to run Python tests from Jenkins ?

To launch all Python tests of the project everyday I use Jenkins and nose2.

On Ubuntu it is easy to install Nose2. Follow the instructions of this link https://nose2.readthedocs.io/en/latest/getting_started.html .

Once nose2 is installed just go to the top directory of your python project for a test. Launch nose2 to run all the Python tests of your project. To give an example of the result with Neural Networks unit tests i have previously created :

nose2_results
All four tests i have created have been run with success

Finally create a new job from Jenkins.Configure the job to get the code from a git repository(for instance) and then launch all tests with nose2. See this link for more information about nose2 and Jenkins integration : https://jenkins.io/solutions/python/

Conclusion

We can imagine many more tests for Neural Networks algorithm. We could test if the algorithm is learning fast or slow. We could check problems such as overfitting, underfitting, etc…

This article about TDD machine learning can give us more ideas about what to verify in our machine learning algorithms :

https://www.safaribooksonline.com/library/view/thoughtful-machine-learning/9781449374075/ch01.html

I may do another post later for more unit tests for these machine learning algorithms.