Monday, August 3, 2015

Repository Connector - Apache ManifoldCF

Writing Repository connector for Apache ManifoldCF

Apache Manifoldcf is framework that lets you connects some source repositories and index the documents, it has an in built security model that allows your target repositories to represent source security model. Target repository is, where you will have the indexes to reside. More on information about the technical structure about ManifoldCF can be found here. My aim would be walking through writing a repository connector, and I have chosen Atlassian confluence repository for the example, and we will be using confluence REST API to retrieve the confluence contents. 

ManifoldCF provides you a framework, that allows you to write repository connector, which is class that will be invoked by the jobs that will run on schedule. By writing this class, framework allows you to wire the UI elements such as form and etc. For example, if you want to write repository connector for confluence, you need some way of telling the ManifoldCF, how to get the confluence API url of the server and credentials that you will need to connect, those are the values coming from relevant UI forms. If you want to write a repository connector, you should start writing a one from inheriting base connector class BaseRepositoryConnector provided by ManifoldCF itself. There few methods that you need to provide implementation. 

You can get the source code that is built against ManifoldCF 1.8 here.

Methods to be overridden and implemented

connect() - public void connect(ConfigParams configParams) , this method lets you to make the connection to the source repository the configParams is sent from UI form of the repository connector. You can use these values to make a connection


check() -  public String check() throws ManifoldCFException,  this method allows you to check if the connection is valid with respect to the values that you have collected via connect method. it returns the string that gives you some description about the validity of current connection. For example if you cannot make the connection, you can simply let it return a string “Connection Failed”


isConnected - public boolean isConnected() returns a Boolean true, if the current connection status is successful, will be utilized by the framework when running the job.

addSeedDocuments - public void addSeedDocuments(ISeedingActivity activities,
                                                DocumentSpecification spec, long startTime, long endTime,
                                                int jobMode) throws ManifoldCFException, ServiceInterruption

This does the actual job of retrieving the contents from the source repository, retrieved contents will be inform of something called seeds, then process documents use this seeds to extract meta-data and indexes the document

getDocumentVersions - public String[] getDocumentVersions(String[] documentIdentifiers,
                                                DocumentSpecification spec) throws ManifoldCFException,
                                                ServiceInterruption

Framework, will use this version numbers to check if a content needs to be re-crawled or not, usually this version number is last modified date of the document


processDocuments - public void processDocuments(String[] documentIdentifiers,
                                                String[] versions, IProcessActivity activities,
                                                DocumentSpecification spec, boolean[] scanOnly)
                                                throws ManifoldCFException, ServiceInterruption

this method will use the seeds, and extract the meta-data and indexes each content, these will be typically transferred to your target repository such as Solr

viewConfiguration - public void viewConfiguration(IThreadContext threadContext,
                                                IHTTPOutput out, Locale locale, ConfigParams parameters)
                                                throws ManifoldCFException, IOException

UI utility method, typically you will fill the parameters with the values that were saved on earlier occasion. Such as, url, API credentials that were persisted in context (usually you would have retrieved those values initially when you tried to connect, using processConfiguration method). Method will be called when UI displays values in “view” mode.

outputConfigurationHeader - public void outputConfigurationHeader(IThreadContext threadContext,
                                                IHTTPOutput out, Locale locale, ConfigParams parameters,
                                                List<String> tabsArray) throws ManifoldCFException, IOException

UI method, which will be invoked by framework to populate the header details in UI. Implementation typically include tab information along with any defaults ones.

processConfigurationPost - public String processConfigurationPost(IThreadContext threadContext,
                                                IPostParameters variableContext, ConfigParams parameters)
                                                throws ManifoldCFException

You will save and posted values from UI, such as API url , API credentials etc.  You will retrieve values from variableContext and save them back to parameters

viewSpecification - public void viewSpecification(IHTTPOutput out, Locale locale,
                                                DocumentSpecification ds) throws ManifoldCFException, IOException


When you want to view the Job specification details of the repository connector, this method will be invoked.

processSpecificationPost - public String processSpecificationPost(IPostParameters variableContext,
                                                DocumentSpecification ds) throws ManifoldCFException

Identical to processConfigurationPost but values posted are relevant Job than repository.  You may process values such as any custom parameters to your API queries.  

outputSpecificationBody - public void outputSpecificationBody(IHTTPOutput out, Locale locale,
                                                DocumentSpecification ds, String tabName)
                                                throws ManifoldCFException, IOException

This method is invoked, when you view the specification details of the job.

outputSpecificationHeader - public void outputSpecificationHeader(IHTTPOutput out, Locale locale,
                                                DocumentSpecification ds, List<String> tabsArray)
                                                throws ManifoldCFException, IOException


Identical to outputConfigurationHeader, but this is for Job.

Structure of a Repository connector.

How does manifold recognize a new connector?, it all works on OSGI, you need create a jar file containing your repository connector and a security connector ( will be looked at later) and drop into connector libraries folder. Once it Manifold starts it will automatically pick your new connector and definitely you need to watch out for the log file manifoldcf.log that can be found in logs folder

1.       Create project, just extent a POM version from the parent Manifold you will have your most of the necessary dependencies imported. Sample pom file may look like this

2.       Resource files, this will contain typical html , javascript files that you need make them available on classpath to be picked by framework, such as file editConfiguration_conf_server.html will be loaded to contain your repository connector details. And you will explicitly locate these files in relevant UI methods described above.

connect - method
super.connect(configParams);

                                confprotocol = params
                                                                .getParameter(ConfluenceConfig.CONF_PROTOCOL_PARAM);
                                confhost = params.getParameter(ConfluenceConfig.CONF_HOST_PARAM);
                                confport = params.getParameter(ConfluenceConfig.CONF_PORT_PARAM);
                                confpath = params.getParameter(ConfluenceConfig.CONF_PATH_PARAM);
                                confsoapapipath = params
                                                                .getParameter(ConfluenceConfig.CONF_SOAP_API_PARAM);
                                clientid = params.getParameter(ConfluenceConfig.CLIENT_ID_PARAM);
                                clientsecret = params
                                                                .getObfuscatedParameter(ConfluenceConfig.CLIENT_SECRET_PARAM);

                                confproxyhost = params
                                                                .getParameter(ConfluenceConfig.CONF_PROXYHOST_PARAM);
                                confproxyport = params
                                                                .getParameter(ConfluenceConfig.CONF_PROXYPORT_PARAM);
                                confproxydomain = params
                                                                .getParameter(ConfluenceConfig.CONF_PROXYDOMAIN_PARAM);
                                confproxyusername = params
                                                                .getParameter(ConfluenceConfig.CONF_PROXYUSERNAME_PARAM);
                                confproxypassword = params
                                                                .getObfuscatedParameter(ConfluenceConfig.CONF_PROXYPASSWORD_PARAM);

                                try {
                                                getConfluenceService();
                                } catch (ManifoldCFException e) {
                                                Logging.connectors.error(e);
                                }


Take values available in configParam and use the to connect to confluence server

check –
try {
                                                return checkConnection();
                                } catch (ServiceInterruption e) {
                                                Logging.connectors.error("Error ", e);
                                                return "Connection temporarily failed: ";

                                } catch (ManifoldCFException e) {
                                                Logging.connectors.error("Error ", e);
                                                return "Connection failed: ";
                                }

Instantiating a separate thread the will check if the connection is valid, but it is not necessary that you need to do this via a thread.

protected String checkConnection() throws ManifoldCFException,
                                                ServiceInterruption {
                                String result = "Unknown";
                                getConfluenceService();
                                CheckConnectionThread t = new CheckConnectionThread(getSession(),
                                                                service);
                                try {
                                                t.start();
                                                t.finishUp();
                                                result = t.result;
                                } catch (InterruptedException e) {
                                                t.interrupt();
                                                throw new ManifoldCFException("Interrupted: " + e.getMessage(), e,
                                                                                ManifoldCFException.INTERRUPTED);
                                } catch (java.net.SocketTimeoutException e) {
                                                handleIOException(e);
                                } catch (InterruptedIOException e) {
                                                t.interrupt();
                                                handleIOException(e);
                                } catch (IOException e) {
                                                handleIOException(e);
                                } catch (ResponseException e) {
                                                handleResponseException(e);
                                }

                                return result;
                }

addSeedDocuments –
GetSeedsThread t = new GetSeedsThread(getSession(), confDriveQuery);
                                try {
                                                t.start();

                                                boolean wasInterrupted = false;
                                                try {
                                                                XThreadStringBuffer seedBuffer = t.getBuffer();

                                                                while (true) {
                                                                                String contentKey = seedBuffer.fetch();
                                                                                if (contentKey == null)
                                                                                                break;
                                                                                // Add the pageID to the queue
                                                                                activities.addSeedDocument(contentKey);
                                                                }
                                                } catch (InterruptedException e) {
                                                                wasInterrupted = true;
                                                                throw e;
                                                } catch (ManifoldCFException e) {
                                                                if (e.getErrorCode() == ManifoldCFException.INTERRUPTED)
                                                                                wasInterrupted = true;
                                                                throw e;
                                                } finally {
                                                                if (!wasInterrupted)
                                                                                t.finishUp();
                                                }
                                } catch (InterruptedException e) {
                                                t.interrupt();
                                                throw new ManifoldCFException("Interrupted: " + e.getMessage(), e,
                                                                                ManifoldCFException.INTERRUPTED);
                                } catch (java.net.SocketTimeoutException e) {
                                                handleIOException(e);
                                } catch (InterruptedIOException e) {
                                                t.interrupt();
                                                handleIOException(e);
                                } catch (IOException e) {
                                                handleIOException(e);
                                } catch (ResponseException e) {
                                                handleResponseException(e);
                                }

Here again a new thread is created to add the seeds, but framework does not necessarily required you to do so.

processDocuments –
for (int i = 0; i < documentIdentifiers.length; i++) {
                                                String nodeId = documentIdentifiers[i];
                                                String version = versions[i];

                                                long startTime = System.currentTimeMillis();
                                                String errorCode = "FAILED";
                                                String errorDesc = StringUtils.EMPTY;
                                                Long fileSize = null;
                                                boolean doLog = false;

                                                try {
                                                                if (Logging.connectors != null) {
                                                                                Logging.connectors.debug("Confluence "
                                                                                                                + ": Processing document identifier '" + nodeId
                                                                                                                + "'");
                                                                }

                                                                if (!scanOnly[i]) {
                                                                                if (version != null) {
                                                                                                doLog = true;

                                                                                                try {
                                                                                                                errorCode = processConfluenceDocuments(nodeId,
                                                                                                                                                activities, version, fileSize);
                                                                                                } catch (Exception e) {
                                                                                                                if (Logging.connectors != null) {
                                                                                                                                Logging.connectors.error(e);
                                                                                                                }
                                                                                                }

                                                                                } else {
                                                                                                activities.deleteDocument(nodeId);
                                                                                }

                                                                                // //
                                                                }
                                                } finally {
                                                                if (doLog)
                                                                                activities.recordActivity(new Long(startTime),
                                                                                                                ACTIVITY_READ, fileSize, nodeId, errorCode,
                                                                                                                errorDesc, null);
                                                }
                                }

You can simply loop through the available seeds and do anything relevant, such as extracting the meta-data or etc.
Due to keep this very brevity, I have omitted other methods, but you can have look on the source code to follow the rest

6 comments: