Writing Repository connector for Apache ManifoldCF
Apache Manifoldcf is framework that lets you connects some source
repositories and index the documents, it has an in built security model that
allows your target repositories to represent source security model. Target
repository is, where you will have the indexes to reside. More on information
about the technical structure about ManifoldCF can be found here.
My aim would be walking through writing a repository connector, and I have
chosen Atlassian confluence repository for the example, and we will be using
confluence REST API to retrieve the confluence contents.
ManifoldCF provides you
a framework, that allows you to write repository connector, which is class that
will be invoked by the jobs that will run on schedule. By writing this class,
framework allows you to wire the UI elements such as form and etc. For example,
if you want to write repository connector for confluence, you need some way of
telling the ManifoldCF, how to get the confluence API url of the server and credentials
that you will need to connect, those are the values coming from relevant UI
forms. If you want to write a repository connector, you should start writing a
one from inheriting base connector class BaseRepositoryConnector
provided by ManifoldCF itself. There few methods that you need to provide
implementation.
You can get the source code that is built against ManifoldCF 1.8 here.
Methods to be overridden and implemented
connect()
- public void
connect(ConfigParams configParams) , this
method lets you to make the connection to the source repository the
configParams is sent from UI form of the repository connector. You can use these
values to make a connection
check()
- public String check() throws ManifoldCFException, this method allows you to check if the
connection is valid with respect to the values that you have collected via
connect method. it returns the string that gives you some description about the
validity of current connection. For example if you cannot make the connection,
you can simply let it return a string “Connection Failed”
isConnected
- public boolean
isConnected()
returns a Boolean true, if the current connection status is successful,
will be utilized by the framework when running the job.
addSeedDocuments
- public void addSeedDocuments(ISeedingActivity activities,
DocumentSpecification
spec, long startTime, long endTime,
int
jobMode) throws ManifoldCFException, ServiceInterruption
This
does the actual job of retrieving the contents from the source repository,
retrieved contents will be inform of something called seeds, then process
documents use this seeds to extract meta-data and indexes the document
getDocumentVersions
- public String[] getDocumentVersions(String[] documentIdentifiers,
DocumentSpecification
spec) throws ManifoldCFException,
ServiceInterruption
Framework,
will use this version numbers to check if a content needs to be re-crawled or
not, usually this version number is last modified date of the document
processDocuments
- public void processDocuments(String[] documentIdentifiers,
String[]
versions, IProcessActivity activities,
DocumentSpecification
spec, boolean[] scanOnly)
throws
ManifoldCFException, ServiceInterruption
this
method will use the seeds, and extract the meta-data and indexes each content,
these will be typically transferred to your target repository such as Solr
viewConfiguration
- public void viewConfiguration(IThreadContext threadContext,
IHTTPOutput
out, Locale locale, ConfigParams parameters)
throws
ManifoldCFException, IOException
UI
utility method, typically you will fill the parameters with the values
that were saved on earlier occasion. Such as, url, API credentials that were
persisted in context (usually you would have retrieved those values initially
when you tried to connect, using processConfiguration method). Method will be
called when UI displays values in “view” mode.
outputConfigurationHeader
- public void outputConfigurationHeader(IThreadContext threadContext,
IHTTPOutput
out, Locale locale, ConfigParams parameters,
List<String>
tabsArray) throws ManifoldCFException, IOException
UI
method, which will be invoked by framework to populate the header details in
UI. Implementation typically include tab information along with any defaults ones.
processConfigurationPost
- public String processConfigurationPost(IThreadContext
threadContext,
IPostParameters
variableContext, ConfigParams parameters)
throws
ManifoldCFException
You
will save and posted values from UI, such as API url , API credentials
etc. You will retrieve values from variableContext
and save them back to parameters
viewSpecification
- public void viewSpecification(IHTTPOutput out, Locale locale,
DocumentSpecification
ds) throws ManifoldCFException, IOException
When
you want to view the Job specification details of the repository connector,
this method will be invoked.
processSpecificationPost
- public String processSpecificationPost(IPostParameters
variableContext,
DocumentSpecification
ds) throws ManifoldCFException
Identical
to processConfigurationPost but values posted are relevant Job than
repository. You may process values such
as any custom parameters to your API queries.
outputSpecificationBody
- public void outputSpecificationBody(IHTTPOutput out, Locale locale,
DocumentSpecification
ds, String tabName)
throws
ManifoldCFException, IOException
This
method is invoked, when you view the specification details of the job.
outputSpecificationHeader
- public void outputSpecificationHeader(IHTTPOutput out, Locale
locale,
DocumentSpecification
ds, List<String> tabsArray)
throws
ManifoldCFException, IOException
Identical
to outputConfigurationHeader, but this is for Job.
Structure of a Repository connector.
How
does manifold recognize a new connector?, it all works on OSGI, you need create
a jar file containing your repository connector and a security connector ( will
be looked at later) and drop into connector libraries folder. Once it Manifold
starts it will automatically pick your new connector and definitely you need to
watch out for the log file manifoldcf.log that can be found in logs folder
1.
Create project, just extent a POM version from
the parent Manifold you will have your most of the necessary dependencies
imported. Sample pom file may look like this
2.
Resource files, this will contain typical html ,
javascript files that you need make them available on classpath to be picked by
framework, such as file editConfiguration_conf_server.html
will be loaded to contain your repository connector details. And you will
explicitly locate these files in relevant UI methods described above.
connect - method
super.connect(configParams);
super.connect(configParams);
confprotocol
= params
.getParameter(ConfluenceConfig.CONF_PROTOCOL_PARAM);
confhost
= params.getParameter(ConfluenceConfig.CONF_HOST_PARAM);
confport
= params.getParameter(ConfluenceConfig.CONF_PORT_PARAM);
confpath
= params.getParameter(ConfluenceConfig.CONF_PATH_PARAM);
confsoapapipath
= params
.getParameter(ConfluenceConfig.CONF_SOAP_API_PARAM);
clientid
= params.getParameter(ConfluenceConfig.CLIENT_ID_PARAM);
clientsecret
= params
.getObfuscatedParameter(ConfluenceConfig.CLIENT_SECRET_PARAM);
confproxyhost
= params
.getParameter(ConfluenceConfig.CONF_PROXYHOST_PARAM);
confproxyport
= params
.getParameter(ConfluenceConfig.CONF_PROXYPORT_PARAM);
confproxydomain
= params
.getParameter(ConfluenceConfig.CONF_PROXYDOMAIN_PARAM);
confproxyusername
= params
.getParameter(ConfluenceConfig.CONF_PROXYUSERNAME_PARAM);
confproxypassword
= params
.getObfuscatedParameter(ConfluenceConfig.CONF_PROXYPASSWORD_PARAM);
try
{
getConfluenceService();
}
catch (ManifoldCFException e) {
Logging.connectors.error(e);
}
Take values available in configParam and use the to connect to confluence server
check –
try {
return
checkConnection();
}
catch (ServiceInterruption e) {
Logging.connectors.error("Error
", e);
return
"Connection temporarily failed: ";
}
catch (ManifoldCFException e) {
Logging.connectors.error("Error
", e);
return
"Connection failed: ";
}
Instantiating a separate thread the will check if the
connection is valid, but it is not necessary that you need to do this via a
thread.
protected String checkConnection() throws
ManifoldCFException,
ServiceInterruption
{
String
result = "Unknown";
getConfluenceService();
CheckConnectionThread
t = new CheckConnectionThread(getSession(),
service);
try
{
t.start();
t.finishUp();
result
= t.result;
}
catch (InterruptedException e) {
t.interrupt();
throw
new ManifoldCFException("Interrupted: " + e.getMessage(), e,
ManifoldCFException.INTERRUPTED);
}
catch (java.net.SocketTimeoutException e) {
handleIOException(e);
}
catch (InterruptedIOException e) {
t.interrupt();
handleIOException(e);
}
catch (IOException e) {
handleIOException(e);
}
catch (ResponseException e) {
handleResponseException(e);
}
return
result;
}
addSeedDocuments –
GetSeedsThread t = new GetSeedsThread(getSession(),
confDriveQuery);
try
{
t.start();
boolean
wasInterrupted = false;
try
{
XThreadStringBuffer
seedBuffer = t.getBuffer();
while
(true) {
String
contentKey = seedBuffer.fetch();
if
(contentKey == null)
break;
//
Add the pageID to the queue
activities.addSeedDocument(contentKey);
}
}
catch (InterruptedException e) {
wasInterrupted
= true;
throw
e;
}
catch (ManifoldCFException e) {
if
(e.getErrorCode() == ManifoldCFException.INTERRUPTED)
wasInterrupted
= true;
throw
e;
}
finally {
if
(!wasInterrupted)
t.finishUp();
}
}
catch (InterruptedException e) {
t.interrupt();
throw
new ManifoldCFException("Interrupted: " + e.getMessage(), e,
ManifoldCFException.INTERRUPTED);
}
catch (java.net.SocketTimeoutException e) {
handleIOException(e);
}
catch (InterruptedIOException e) {
t.interrupt();
handleIOException(e);
}
catch (IOException e) {
handleIOException(e);
}
catch (ResponseException e) {
handleResponseException(e);
}
Here again a new thread is created to add the seeds, but
framework does not necessarily required you to do so.
processDocuments –
for (int i = 0; i < documentIdentifiers.length; i++) {
String
nodeId = documentIdentifiers[i];
String
version = versions[i];
long
startTime = System.currentTimeMillis();
String
errorCode = "FAILED";
String
errorDesc = StringUtils.EMPTY;
Long
fileSize = null;
boolean
doLog = false;
try
{
if
(Logging.connectors != null) {
Logging.connectors.debug("Confluence
"
+
": Processing document identifier '" + nodeId
+
"'");
}
if
(!scanOnly[i]) {
if
(version != null) {
doLog
= true;
try
{
errorCode
= processConfluenceDocuments(nodeId,
activities,
version, fileSize);
}
catch (Exception e) {
if
(Logging.connectors != null) {
Logging.connectors.error(e);
}
}
}
else {
activities.deleteDocument(nodeId);
}
//
//
}
}
finally {
if
(doLog)
activities.recordActivity(new
Long(startTime),
ACTIVITY_READ,
fileSize, nodeId, errorCode,
errorDesc,
null);
}
}
You can simply loop through the available seeds and do
anything relevant, such as extracting the meta-data or etc.
Due to keep this very brevity, I have omitted other methods,
but you can have look on the source
code to follow the rest
This comment has been removed by the author.
ReplyDeleteThank you for taking the time and sharing this information with us. It was indeed very helpful and insightful while being straight forward and to the point.
ReplyDeleteAngularjs Training in Chennai
Angular 4 Training in Chennai
angularjs training center in chennai
Angularjs Training Chennai
Angularjs courses in Chennai
This is an excellent post that is being shared. Kindly do share more post in this sorts.
ReplyDeleteEnglish Speaking Classes in Mumbai
Best English Speaking Institute in Mumbai
Spoken English Classes in Mumbai
Best English Speaking Classes in Mumbai
English Speaking Course in Mumbai
English Coaching Classes in Mumbai
Best English Classes in Mumbai
ilovetyping
ReplyDeleteLogin Your exness login Account To Read The Latest News About The Platform.s
ReplyDeletehi thanku so much this infromation thanku so much
ReplyDeletecs executive
freecseetvideolectures/